Capturing Chemistry in XMLCML ACS March 2004 Capturing
- Slides: 40
Capturing Chemistry in XML/CML ACS March 2004 Capturing Chemistry in XML/CML * * * J. A. Townsend , S. E. Adams , J. M. Goodman , * * P. Murray-Rust , C. A. Waudby * Unilever Centre for Molecular Informatics, University of Cambridge
The Agony Of Publication - Loss The World Capturing Chemistry in XML/CML ACS March 2004
The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World Web Pages The Lab Sad Journals The Scientist
The Vision-1 Human-readable Capturing Chemistry in XML/CML ACS March 2004 Machine-readable <scalar dict. Ref=“ccml: mp” units=“units: c” min. Value=“ 65” max. Value=“ 66” /> mp 65 -66 C
The Vision-2 Capturing Chemistry in XML/CML ACS March 2004 • Chemists can carry on doing what they want But also ü Reuse chemistry ü Archive data ü Ensure validity of data ü Create new sources of data / molecules
Our Approach • • • Capturing Chemistry in XML/CML ACS March 2004 Let chemists use familiar programs … …and document templates Focus on Journal Articles, Theses, Comp. Chem Create data for knowledge-based discovery Let computers do the work Evolution…
Machine Parsing of Chemistry Structured (Comp. Chem) ACS March 2004 MACHINE Semi-Structured (Articles) Unstructured (Discussion) Capturing Chemistry in XML/CML PARSING ? Structured documents and data in XML
Capturing Chemistry in XML/CML How? ACS March 2004 Abstract Article semistructured Add Structure Discussion Parse with Regular Expressions Experimental Legacy to CML converters
Capturing Chemistry in XML/CML Regular Expressions ACS March 2004 Melting point: two possible syntaxes m. p. > 23. 5 °C mp 23. 5 – 25 °C Capital or lowercase ‘m’ Lowercase ‘p’ Maybe whitespace Maybe degrees sign [Mm]. ? pp{Punct}? s+>? s? d*. ? d? s? -s? d*? . ? d? s°? s? C Maybe ‘. ’ Any punctuation 0 or more digits Capital ‘C’
CML - XML For Chemistry • • • Capturing Chemistry in XML/CML Based on W 3 C XML Schemas 300+ components Customisable Extensible through dictionaries Openly available software J. Chem. Inf. Comp. Sci. , 2003, 43, 757 ACS March 2004
The CML Family Capturing Chemistry in XML/CML ACS March 2004 Controlled XMLNamespaces: CMLCore – compounds and properties CMLReact – reactions CMLSpect – spectra* CMLComp – comp. Chem CMLCryst – crystallography and condensed matter Interoperates with HTML, Math. ML, SVG, *Ani. ML+, *Thermo. ML$, etc. + spectra: ANSI/JCAMP thermochemistry: NIST $ J. Chem. Inf. Comp. Sci. , 2003, 43, 757
Case Studies Capturing Chemistry in XML/CML ACS March 2004 Parsing output from 750, 000 MOPAC jobs High-throughput parsing of journals
Comp. Chem Logs Capturing Chemistry in XML/CML ACS March 2004 Coordinates Calculation Type Molecular Formula Point Group Total Energy Dipole
Loss From Comp. Chem Coordinates Capturing Chemistry in XML/CML Calculation Type Molecular Formula Dipole Total Energy Ionisation Potential ACS March 2004
Loss From Comp. Chem Coordinates Capturing Chemistry in XML/CML Calculation Type Molecular Formula Dipole Total Energy Ionisation Potential ACS March 2004
Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Comp. Chem Output Parsers CML File Input/job. Control General CMLCore Coordinates CMLCore Energy Levels Energy Level CMLComp Vibrations Vibration CMLSpect
Capturing Chemistry in XML/CML Display Process 1 Comp. Chem Log ACS March 2004 Xindice CML XSLT
Capturing Chemistry in XML/CML Display Process 2 ACS March 2004 Display comp. Chem Output CML File Input/job. Control CMLCore Coordinates CMLCore 3 D structure, electronic properties XSLT Normal modes CMLComp Energy Levels Vibrations CMLSpect 2 D structure, thermodynamic properties
Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Dictionary Entry: The pointgroup of a molecule. . . The Schoenflies convention is normally used, but Hermann Mauguin is also allowed. D [debye] Parent. SI: c. m Multiplier: 3. 335641 E-30 CGS units for electric dipole
Capturing Chemistry in XML/CML Dictionaries Linked to CML schema ACS March 2004 Accesses CCML namespace <scalar dict. Ref=“ccml: mp” units=“units: c” min. Value=“ 65” max. Value=“ 66” /> Units dictionary id="celsius" name="Celsius" parent. SI="k" multiplier. To. SI="1" constant. To. SI="273. 15" abbreviation="C" unit. Type="temp" id="meltrange" term="Melting range" definition="Minimum and maximum values of melting range in degrees Celsius"
OSCAR Capturing Chemistry in XML/CML ACS March 2004 Open Source Chemistry Analysis Routines Sponsored by the Royal Society of Chemistry (Cambridge) Mounted on http: //www. rsc. org/
Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in XML/CML ACS March 2004
Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in XML/CML ACS March 2004
Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in XML/CML ACS March 2004
Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in XML/CML ACS March 2004
Article Structure Article Front Matter Abstract Introduction Discussion Results Experimental References Capturing Chemistry in XML/CML ACS March 2004
Article Structure Article Capturing Chemistry in XML/CML ACS March 2004 Experimental Front Matter Abstract Set up Introduction Discussion Compound Name Results Experimental References Synthesis Analysis
Information Checked / Extracted Capturing Chemistry in XML/CML ACS March 2004 • Chemical name • Elemental Analysis • Yield • Optical Rotation • Boiling / Melting point • Refractive Index • Carbon NMR • Rf value • Hydrogen NMR • Ultra Violet spectrometry • Infra Red spectrometry • Nature (colour, state, modifiers, description, etc. ) • Mass spectrometry
OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004 H NMR Nature HRMS
OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004
OSCAR Data Found Capturing Chemistry in XML/CML Results from one paper ACS March 2004
OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 Serious Error Warning Type 1 Warning Type 2
OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 ~30 errors / warnings searched for This article has: 4 errors 2 warnings (type 1) 30 warnings (type 2) Elemental analysis, incorrect – calculations are for a different molecular formula
OSCAR Data Presentation Capturing Chemistry in XML/CML ACS March 2004
OSCAR Speed Capturing Chemistry in XML/CML ACS March 2004 High throughput, high precision A typical paper contains ca. 20 compounds JOC (Feb 2004) contains ~600 compounds OSCAR could extract and tabulate in under 5 minutes OBC (Feb 2004) contains ~300 compounds OSCAR could extract and tabulate in under 3 minutes
OSCAR Accuracy Capturing Chemistry in XML/CML ACS March 2004 92 % of Data Correctly Identified 3 % incorrect author entry 5 % missed False-positives: 3 % 437 items, ~10, 000 data fields in test set, working with current Regular Expressions
Capturing Chemistry in XML/CML XML-CML Databases ACS March 2004 Comp. Chem CML Journals Theses XMLDb can support > 250, 000 molecules Millisecond retrieval on INCh. I, properties Xindice
Capturing Molecules Capturing Chemistry in XML/CML ACS March 2004 Encourage chemists to • Autogenerate IUPAC INCh. I universal identifier • Embed MDLMol or Chemdraw files in MSWord • Autoconvert to CML connection table • Next phase: • Parse chemical names into CML using modern NLP+ • Learning-machine rather than rule-based • + Natural Language Processing
NLP & Parsing Names KEY: Locant Multiplier Capturing Chemistry in XML/CML ACS March 2004 Characteristic Group Mono valent parent hydride Heterocyclic parent hydride
Thank You Capturing Chemistry in XML/CML ACS March 2004 Unilever RSC Jonathan Goodman Sam Adams Fraser Norton Chris Waudby Yong Zhang
- Anthem of poland
- Acs green chemistry
- Acs programming language
- Nste acs
- Acs
- Secure access acs
- Concurso acs recife
- Rd acs
- Acs cts
- Acs packaging
- Popmes
- Cisco ise urt
- Atlas acs 2 timi 51
- Acs web portal
- Retorno acs esquema
- Pins petition nyc
- Wyoming medicaid provider manual
- Acs uzbūve
- Customs reconciliation
- Zoltan j. acs
- Bogdan's eso
- 550
- Acs 510
- Rd acs
- Acs algorithm
- Acs
- Acs comp
- Tr-069 message flow
- Direktorat tik upi
- Preventive services (acs)
- Acs dobfar welfare
- Acs technical divisions
- Viktor fedun
- Nyc acs
- Acs leadership
- Fort gordon horse stables
- Acs attitude control system
- Cwmp что это
- Acs
- Avocent acs 6000
- Atlas timi 51