Extracting human protein interactions from MEDLINE using a












- Slides: 12
Extracting human protein interactions from MEDLINE using a full-sentence parser Nikolai Daraselia, Anton Yuryev, Sergei Egorov, Svetalana Novichkova, Alexander Nikitin and Ilyia Mazo Presented by: Lily Hsu March 30, 2004
Background § # of available databases that cover different aspects of protein functions § These databases rarely store more than a few thousands of the best known protein relationships § Are also not up to date § Use statistical methods to Natural Language Processing (NLP) Techniques
Ways to extract protein relations § § Detect co-occurrence of protein names Matching of pre-specified patterns or rules Shallow parsing technique Full sentence parsing
Proposed way of efficient extraction § Natural Language Processing (NLP) – Deals with domain-independent sentence structure decomposition § Information Extraction (2 steps) – Construction of the sentence argument structure using general-purpose domain-independent parsers – Domain-specific frame extraction
Med. Scan § Extracts function associations between proteins, cell processes, and small molecules, recognizes types of regulatory mechanisms involved, the effects of regulation
3 Major Components § Preprocessor: biomedical concept identification system – Reads XML format of MEDLINE record and splits abstract into individual sentences – Selects sentences containing at least one protein name § NLP Component: syntactical and semantic processing of the text – Processes sentences from abstracts and produces a set of semantic structures representing the meaning of each sentence – 2 Steps
3 Major Components – 2 Steps § Syntactic parser constructs a set of alternative syntactic structures of an input sentence § Semantic processor transforms each of them into a corresponding semantic tree § Information Extraction – Utilizes set of user extendible tree matching rules to validate and convert structure of the semantic tree into normalized ontological knowledge
3 Major Components § Information Extraction – Ontology: a collection of concepts representing domainspecific entities § Entity: protein or compound § Control: functions of proteins and compounds – Knowledge Base: assigning meanings (senses) to words and including biological terms and the various relations between them § Template Lexemes: concepts described by the frames in the ontology § Connecting Lexemes: expresses relations between entities by the ontologcal links
3 Major Components § Information Extraction – Ontological Interpretation: semantic tree-driven process performed recursively in a top-down matter (Fig 3) § Template Lexeme: – Find senses in knowledge base that correspond to the parent and child lexemes – Appropriate slot of the parental sense is identified by comparing the label of the processed semantic vertex to thematic slot labels § Connecting Lexeme: – Link formed between two child arguments of the lexeme
3 Major Components § Information Extraction – Converter: converts constructed frame tree into a form of generalized conceptual graph and records it in an XML based format § Transforms a frame tree into the set of functional links between proteins, cellular processes, cellular components and small molecules by categorizing them into either Control or Entity groups.
Results § ~3. 5 million MEDLINE abstracts § Preprocessor selected 3. 4 million sentences containing at least one human protein § NLP parsed 1. 2 million and submitted for extraction § 3601 total interactions corresponding to 2976 distinct protein-protein interactions have been extracted § 91% precision 21% recall rate
What now? Concerns? § How to we find a balance between recall and precision? § Where does their data come from, how did they do their comparisons? § Other problems? ? ?