Extracting human protein interactions from MEDLINE using a

  • Slides: 12
Download presentation
Extracting human protein interactions from MEDLINE using a full-sentence parser Nikolai Daraselia, Anton Yuryev,

Extracting human protein interactions from MEDLINE using a full-sentence parser Nikolai Daraselia, Anton Yuryev, Sergei Egorov, Svetalana Novichkova, Alexander Nikitin and Ilyia Mazo Presented by: Lily Hsu March 30, 2004

Background § # of available databases that cover different aspects of protein functions §

Background § # of available databases that cover different aspects of protein functions § These databases rarely store more than a few thousands of the best known protein relationships § Are also not up to date § Use statistical methods to Natural Language Processing (NLP) Techniques

Ways to extract protein relations § § Detect co-occurrence of protein names Matching of

Ways to extract protein relations § § Detect co-occurrence of protein names Matching of pre-specified patterns or rules Shallow parsing technique Full sentence parsing

Proposed way of efficient extraction § Natural Language Processing (NLP) – Deals with domain-independent

Proposed way of efficient extraction § Natural Language Processing (NLP) – Deals with domain-independent sentence structure decomposition § Information Extraction (2 steps) – Construction of the sentence argument structure using general-purpose domain-independent parsers – Domain-specific frame extraction

Med. Scan § Extracts function associations between proteins, cell processes, and small molecules, recognizes

Med. Scan § Extracts function associations between proteins, cell processes, and small molecules, recognizes types of regulatory mechanisms involved, the effects of regulation

3 Major Components § Preprocessor: biomedical concept identification system – Reads XML format of

3 Major Components § Preprocessor: biomedical concept identification system – Reads XML format of MEDLINE record and splits abstract into individual sentences – Selects sentences containing at least one protein name § NLP Component: syntactical and semantic processing of the text – Processes sentences from abstracts and produces a set of semantic structures representing the meaning of each sentence – 2 Steps

3 Major Components – 2 Steps § Syntactic parser constructs a set of alternative

3 Major Components – 2 Steps § Syntactic parser constructs a set of alternative syntactic structures of an input sentence § Semantic processor transforms each of them into a corresponding semantic tree § Information Extraction – Utilizes set of user extendible tree matching rules to validate and convert structure of the semantic tree into normalized ontological knowledge

3 Major Components § Information Extraction – Ontology: a collection of concepts representing domainspecific

3 Major Components § Information Extraction – Ontology: a collection of concepts representing domainspecific entities § Entity: protein or compound § Control: functions of proteins and compounds – Knowledge Base: assigning meanings (senses) to words and including biological terms and the various relations between them § Template Lexemes: concepts described by the frames in the ontology § Connecting Lexemes: expresses relations between entities by the ontologcal links

3 Major Components § Information Extraction – Ontological Interpretation: semantic tree-driven process performed recursively

3 Major Components § Information Extraction – Ontological Interpretation: semantic tree-driven process performed recursively in a top-down matter (Fig 3) § Template Lexeme: – Find senses in knowledge base that correspond to the parent and child lexemes – Appropriate slot of the parental sense is identified by comparing the label of the processed semantic vertex to thematic slot labels § Connecting Lexeme: – Link formed between two child arguments of the lexeme

3 Major Components § Information Extraction – Converter: converts constructed frame tree into a

3 Major Components § Information Extraction – Converter: converts constructed frame tree into a form of generalized conceptual graph and records it in an XML based format § Transforms a frame tree into the set of functional links between proteins, cellular processes, cellular components and small molecules by categorizing them into either Control or Entity groups.

Results § ~3. 5 million MEDLINE abstracts § Preprocessor selected 3. 4 million sentences

Results § ~3. 5 million MEDLINE abstracts § Preprocessor selected 3. 4 million sentences containing at least one human protein § NLP parsed 1. 2 million and submitted for extraction § 3601 total interactions corresponding to 2976 distinct protein-protein interactions have been extracted § 91% precision 21% recall rate

What now? Concerns? § How to we find a balance between recall and precision?

What now? Concerns? § How to we find a balance between recall and precision? § Where does their data come from, how did they do their comparisons? § Other problems? ? ?