Capturing Chemistry in XMLCML ACS March 2004 Capturing

  • Slides: 40
Download presentation
Capturing Chemistry in XML/CML ACS March 2004 Capturing Chemistry in XML/CML * * *

Capturing Chemistry in XML/CML ACS March 2004 Capturing Chemistry in XML/CML * * * J. A. Townsend , S. E. Adams , J. M. Goodman , * * P. Murray-Rust , C. A. Waudby * Unilever Centre for Molecular Informatics, University of Cambridge

The Agony Of Publication - Loss The World Capturing Chemistry in XML/CML ACS March

The Agony Of Publication - Loss The World Capturing Chemistry in XML/CML ACS March 2004

The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The

The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World Web Pages The Lab Sad Journals The Scientist

The Vision-1 Human-readable Capturing Chemistry in XML/CML ACS March 2004 Machine-readable <scalar dict. Ref=“ccml:

The Vision-1 Human-readable Capturing Chemistry in XML/CML ACS March 2004 Machine-readable <scalar dict. Ref=“ccml: mp” units=“units: c” min. Value=“ 65” max. Value=“ 66” /> mp 65 -66 C

The Vision-2 Capturing Chemistry in XML/CML ACS March 2004 • Chemists can carry on

The Vision-2 Capturing Chemistry in XML/CML ACS March 2004 • Chemists can carry on doing what they want But also ü Reuse chemistry ü Archive data ü Ensure validity of data ü Create new sources of data / molecules

Our Approach • • • Capturing Chemistry in XML/CML ACS March 2004 Let chemists

Our Approach • • • Capturing Chemistry in XML/CML ACS March 2004 Let chemists use familiar programs … …and document templates Focus on Journal Articles, Theses, Comp. Chem Create data for knowledge-based discovery Let computers do the work Evolution…

Machine Parsing of Chemistry Structured (Comp. Chem) ACS March 2004 MACHINE Semi-Structured (Articles) Unstructured

Machine Parsing of Chemistry Structured (Comp. Chem) ACS March 2004 MACHINE Semi-Structured (Articles) Unstructured (Discussion) Capturing Chemistry in XML/CML PARSING ? Structured documents and data in XML

Capturing Chemistry in XML/CML How? ACS March 2004 Abstract Article semistructured Add Structure Discussion

Capturing Chemistry in XML/CML How? ACS March 2004 Abstract Article semistructured Add Structure Discussion Parse with Regular Expressions Experimental Legacy to CML converters

Capturing Chemistry in XML/CML Regular Expressions ACS March 2004 Melting point: two possible syntaxes

Capturing Chemistry in XML/CML Regular Expressions ACS March 2004 Melting point: two possible syntaxes m. p. > 23. 5 °C mp 23. 5 – 25 °C Capital or lowercase ‘m’ Lowercase ‘p’ Maybe whitespace Maybe degrees sign [Mm]. ? pp{Punct}? s+>? s? d*. ? d? s? -s? d*? . ? d? s°? s? C Maybe ‘. ’ Any punctuation 0 or more digits Capital ‘C’

CML - XML For Chemistry • • • Capturing Chemistry in XML/CML Based on

CML - XML For Chemistry • • • Capturing Chemistry in XML/CML Based on W 3 C XML Schemas 300+ components Customisable Extensible through dictionaries Openly available software J. Chem. Inf. Comp. Sci. , 2003, 43, 757 ACS March 2004

The CML Family Capturing Chemistry in XML/CML ACS March 2004 Controlled XMLNamespaces: CMLCore –

The CML Family Capturing Chemistry in XML/CML ACS March 2004 Controlled XMLNamespaces: CMLCore – compounds and properties CMLReact – reactions CMLSpect – spectra* CMLComp – comp. Chem CMLCryst – crystallography and condensed matter Interoperates with HTML, Math. ML, SVG, *Ani. ML+, *Thermo. ML$, etc. + spectra: ANSI/JCAMP thermochemistry: NIST $ J. Chem. Inf. Comp. Sci. , 2003, 43, 757

Case Studies Capturing Chemistry in XML/CML ACS March 2004 Parsing output from 750, 000

Case Studies Capturing Chemistry in XML/CML ACS March 2004 Parsing output from 750, 000 MOPAC jobs High-throughput parsing of journals

Comp. Chem Logs Capturing Chemistry in XML/CML ACS March 2004 Coordinates Calculation Type Molecular

Comp. Chem Logs Capturing Chemistry in XML/CML ACS March 2004 Coordinates Calculation Type Molecular Formula Point Group Total Energy Dipole

Loss From Comp. Chem Coordinates Capturing Chemistry in XML/CML Calculation Type Molecular Formula Dipole

Loss From Comp. Chem Coordinates Capturing Chemistry in XML/CML Calculation Type Molecular Formula Dipole Total Energy Ionisation Potential ACS March 2004

Loss From Comp. Chem Coordinates Capturing Chemistry in XML/CML Calculation Type Molecular Formula Dipole

Loss From Comp. Chem Coordinates Capturing Chemistry in XML/CML Calculation Type Molecular Formula Dipole Total Energy Ionisation Potential ACS March 2004

Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Comp. Chem Output Parsers CML

Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Comp. Chem Output Parsers CML File Input/job. Control General CMLCore Coordinates CMLCore Energy Levels Energy Level CMLComp Vibrations Vibration CMLSpect

Capturing Chemistry in XML/CML Display Process 1 Comp. Chem Log ACS March 2004 Xindice

Capturing Chemistry in XML/CML Display Process 1 Comp. Chem Log ACS March 2004 Xindice CML XSLT

Capturing Chemistry in XML/CML Display Process 2 ACS March 2004 Display comp. Chem Output

Capturing Chemistry in XML/CML Display Process 2 ACS March 2004 Display comp. Chem Output CML File Input/job. Control CMLCore Coordinates CMLCore 3 D structure, electronic properties XSLT Normal modes CMLComp Energy Levels Vibrations CMLSpect 2 D structure, thermodynamic properties

Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Dictionary Entry: The pointgroup of

Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Dictionary Entry: The pointgroup of a molecule. . . The Schoenflies convention is normally used, but Hermann Mauguin is also allowed. D [debye] Parent. SI: c. m Multiplier: 3. 335641 E-30 CGS units for electric dipole

Capturing Chemistry in XML/CML Dictionaries Linked to CML schema ACS March 2004 Accesses CCML

Capturing Chemistry in XML/CML Dictionaries Linked to CML schema ACS March 2004 Accesses CCML namespace <scalar dict. Ref=“ccml: mp” units=“units: c” min. Value=“ 65” max. Value=“ 66” /> Units dictionary id="celsius" name="Celsius" parent. SI="k" multiplier. To. SI="1" constant. To. SI="273. 15" abbreviation="C" unit. Type="temp" id="meltrange" term="Melting range" definition="Minimum and maximum values of melting range in degrees Celsius"

OSCAR Capturing Chemistry in XML/CML ACS March 2004 Open Source Chemistry Analysis Routines Sponsored

OSCAR Capturing Chemistry in XML/CML ACS March 2004 Open Source Chemistry Analysis Routines Sponsored by the Royal Society of Chemistry (Cambridge) Mounted on http: //www. rsc. org/

Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in

Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in XML/CML ACS March 2004

Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in

Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in XML/CML ACS March 2004

Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in

Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in XML/CML ACS March 2004

Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in

Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in XML/CML ACS March 2004

Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in

Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in XML/CML ACS March 2004

Article Structure Article Capturing Chemistry in XML/CML ACS March 2004 Experimental Front Matter Abstract

Article Structure Article Capturing Chemistry in XML/CML ACS March 2004 Experimental Front Matter Abstract Set up Introduction Discussion Compound Name Results Experimental References Synthesis Analysis

Information Checked / Extracted Capturing Chemistry in XML/CML ACS March 2004 • Chemical name

Information Checked / Extracted Capturing Chemistry in XML/CML ACS March 2004 • Chemical name • Elemental Analysis • Yield • Optical Rotation • Boiling / Melting point • Refractive Index • Carbon NMR • Rf value • Hydrogen NMR • Ultra Violet spectrometry • Infra Red spectrometry • Nature (colour, state, modifiers, description, etc. ) • Mass spectrometry

OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004 H NMR Nature HRMS

OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004 H NMR Nature HRMS

OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004

OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004

OSCAR Data Found Capturing Chemistry in XML/CML Results from one paper ACS March 2004

OSCAR Data Found Capturing Chemistry in XML/CML Results from one paper ACS March 2004

OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 Serious Error Warning Type

OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 Serious Error Warning Type 1 Warning Type 2

OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 ~30 errors / warnings

OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 ~30 errors / warnings searched for This article has: 4 errors 2 warnings (type 1) 30 warnings (type 2) Elemental analysis, incorrect – calculations are for a different molecular formula

OSCAR Data Presentation Capturing Chemistry in XML/CML ACS March 2004

OSCAR Data Presentation Capturing Chemistry in XML/CML ACS March 2004

OSCAR Speed Capturing Chemistry in XML/CML ACS March 2004 High throughput, high precision A

OSCAR Speed Capturing Chemistry in XML/CML ACS March 2004 High throughput, high precision A typical paper contains ca. 20 compounds JOC (Feb 2004) contains ~600 compounds OSCAR could extract and tabulate in under 5 minutes OBC (Feb 2004) contains ~300 compounds OSCAR could extract and tabulate in under 3 minutes

OSCAR Accuracy Capturing Chemistry in XML/CML ACS March 2004 92 % of Data Correctly

OSCAR Accuracy Capturing Chemistry in XML/CML ACS March 2004 92 % of Data Correctly Identified 3 % incorrect author entry 5 % missed False-positives: 3 % 437 items, ~10, 000 data fields in test set, working with current Regular Expressions

Capturing Chemistry in XML/CML XML-CML Databases ACS March 2004 Comp. Chem CML Journals Theses

Capturing Chemistry in XML/CML XML-CML Databases ACS March 2004 Comp. Chem CML Journals Theses XMLDb can support > 250, 000 molecules Millisecond retrieval on INCh. I, properties Xindice

Capturing Molecules Capturing Chemistry in XML/CML ACS March 2004 Encourage chemists to • Autogenerate

Capturing Molecules Capturing Chemistry in XML/CML ACS March 2004 Encourage chemists to • Autogenerate IUPAC INCh. I universal identifier • Embed MDLMol or Chemdraw files in MSWord • Autoconvert to CML connection table • Next phase: • Parse chemical names into CML using modern NLP+ • Learning-machine rather than rule-based • + Natural Language Processing

NLP & Parsing Names KEY: Locant Multiplier Capturing Chemistry in XML/CML ACS March 2004

NLP & Parsing Names KEY: Locant Multiplier Capturing Chemistry in XML/CML ACS March 2004 Characteristic Group Mono valent parent hydride Heterocyclic parent hydride

Thank You Capturing Chemistry in XML/CML ACS March 2004 Unilever RSC Jonathan Goodman Sam

Thank You Capturing Chemistry in XML/CML ACS March 2004 Unilever RSC Jonathan Goodman Sam Adams Fraser Norton Chris Waudby Yong Zhang