Evaluation of different benchmark sets and evaluation methods
Evaluation of different benchmark sets and evaluation methods for automatic extraction of chemical entities from text and image Marc Zimmermann, Martin Hofmann-Apitius Bonn-Aachen Center for Information Technology, University of Bonn 5 th Meeting on U. S. Government Chemical Databases and Open Chemistry, August 2011 Bonn-Aachen International Center for Information Technology
Chemical Structure Reconstruction Bonn-Aachen International Center for Information Technology
SCAIView: gene + protein index (lucene + semantic entities) Link out to biological reference databases Entities handled by Pro. Miner Semantic tagging
Beautiful Artwork But Wrong Molecule Bonn-Aachen International Center for Information Technology
Automatic Binning of Images Database Bonn-Aachen International Center for Information Technology Curation Trash
Challenge • Problem – Predict the quality of the reconstruction result without a reference molecule • Solution – Machine learning • Expected results – Quality of new reconstructions estimated by trained models Bonn-Aachen International Center for Information Technology
The Evaluation Concept of SCAI and Info. Chem Manual abstraction of chemical names 1 Comparison (quantitative) 2 NER Pdf to Text N 2 S ICANNOTATOR 5 3 Automatic chemical verification Page segmentation Image classifier Chemical recognition chemo. CR (Fraunhofer) 4 Manual abstraction of structures from images Bonn-Aachen International Center for Information Technology Comparison (quantitative) “Similarity MCD” Database
Chemistry to be defined: Examples from Patents (I) Bonn-Aachen International Center for Information Technology
Chemistry to be defined: Examples from Patents (II) Bonn-Aachen International Center for Information Technology
Quality Measure: Graph Matching • Similarity. MCD (Minimal Chemical Distance) – Module from Info. Chem • Graph-matching on – Reconstruction result of chemo. CR – The reference molecule • Results in – Numerical value, [0, 1] Bonn-Aachen International Center for Information Technology OK bad
Chemical Error Classification Scheme • MISSED – BOND_MISSED • COMPLETE_BOND_MISSED • ORDER_BOND_MISSED • CHIRAL_BOND_MISSED – SYMBOL_MISSED • ATOM_SYMBOL_MISSED • ISOTOPE_SYMBOL_MISSED • CHARGE_SYMBOL_MISSED • RADICAL_SYMBOL_MISSED Bonn-Aachen International Center for Information Technology
Mapping of Reaction Schemes with Spatial Constraints reconstruction reference Bonn-Aachen International Center for Information Technology
Mining of Chemical Names • Chemical names should be found in the text • Synonyms and spelling variations in different databases Sodium lauryl sulfate (DB 00815 Drug. Bank) : 230 brand names and 26 synonyms • Several Text Mining techniques developed Bonn-Aachen International Center for Information Technology
Compounds sharing a Synonym “Livesan” An entry DB 00436 Bendroflumethiazide from Drug. Bank Bonn-Aachen International Center for Information Technology An entry Procetofen C 07586 from the KEGG Compound
Task: Generating a Dictionary – rather reliable data sources – recognizes different chemical names referring to the same structure and to map them to the unique identifier § (-)-Epiafzelechin § epi-Afzelechin UID 1 § 5 -(1 -cycloheptenyl)-5 -ethyl-1, 3 diazinane-2, 4, 6 -trione § Heptabarbital Bonn-Aachen International Center for Information Technology
Different Mapping Approaches • Synonym based • Interlink based • Structure based C 02265 D-Phenylalanine; D-alpha-Amino-betaphenylpropionic acid. Bonn-Aachen International Center for Information Technology DB 02556 D-Phenylalanine; (2 R)-2 -amino-3 phenylpropanoic acid
Interlink based Approach • Non-unified approach towards parametric isomers • Link structurally different compounds: D 02592 from KEGG Drug DB 01234 from Drug. Bank Bonn-Aachen International Center for Information Technology
Problem: Merging Data Sources to UID Identity problem (“parametric isomers”): – – – – Stereochemistry Tautomerism Charges Isotopes Mixtures Polymers Aromaticity Markush Structures Bonn-Aachen International Center for Information Technology
Workflow Bonn-Aachen International Center for Information Technology
Importing SDF into SQL Schema KEGG COMPOUND 2 KEGG DRUG 2 SDF files Drugcard files Drug. Bank 1 1. 2. http: //www. drugbank. ca/ last accessed August 2010 http: //kegg. jp/ last accessed August 2010 Bonn-Aachen International Center for Information Technology
Dictionary Comparison Entries are transformed into Binary correspondences – all possible pairs between the compounds from one entry Dictionary 1 Entry 1. Compound 1, Compound 2, Compound 3 Dictionary 2 Entry 1. Compound 1, Compound 2, Compound 4 Dictionary 1 Dictionary 2 Entry 1. Compound 1, Compound 2 - present Compound 1, Compound 3 - absent Compound 2, Compound 3 - absent Entry 1. Compound 1, Compound 2 - present Compound 1, Compound 4 - absent Compound 2, Compound 4 - absent Bonn-Aachen International Center for Information Technology
Overlap of Binary Correspondences Drug. Bank & KEGG Bonn-Aachen International Center for Information Technology
The Open Pharmacological Concepts Triple Store Prototype: • Develop a set of robust standards… • Implement the standards in a semantic integration hub (“Open Pharmacological Space”)… • Deliver services to support on-going drug discovery programs in pharma and public domain… Bonn-Aachen International Center for Information Technology www. openphacts. org
Conclusions • Chemical information extraction is an ongoing effort • Task is challenging • In need of critical assessments and gold standards – Structure reconstruction http: //trec. nist. gov/ – Database mapping – Retrieval tasks • In need of strategies – Deal with reconstruction errors – Extended file formats & search algorithms – Result visualizations Bonn-Aachen International Center for Information Technology
- Slides: 24