Making Data Interoperable Pub Chem DemoUse Case Feb
Making Data Interoperable: Pub. Chem Demo/Use Case Feb 27, 2019 Evan Bolton, Ph. D. U. S. National Center for Biotechnology Information (NCBI) MPS NSF FAIR Hackathon Workshop
Pub. Chem resource https: //pubchem. ncbi. nlm. nih. gov/ • Archive of chemical substance information and their biological activities • Integrated with additional authoritative information about chemicals • Caters to data scientists through programmatic access to information and machine readable formats
Expanding available chemical information More than 30 M chemicals with annotation and growing in breadth and depth
Pub. Chem Data Sources https: //pubchem. ncbi. nlm. nih. gov/sources/ 4
Pub. Chem Data Sources – what data comes from whom? (provenance) https: //pubchem. ncbi. nlm. nih. gov/source/ECHA 5
Pub. Chem Data Sources – data downloadable 6
Classification Trees are a form of hierarchical annotation 7
Classification Browser enables finding annotations of a particular type Give me all “Biologics” in Pub. Chem 8
Pub. Chem Compound TOC (table of contents) https: //pubchem. ncbi. nlm. nih. gov/classification/#hid=72 And many more…. 9
Many annotation information sources https: //pubchem. ncbi. nlm. nih. gov/classification/#hid=72 10
Many types of information https: //pubchem. ncbi. nlm. nih. gov/classification/#hid=72 11
Many chemicals, many properties (Overall +500 types of information) 12
Auto-generated annotation from Ch. EBI is comparable to human generated
Pub. Chem data complexity • Many links between large record collections – – – – – ~245 M Substances <-> ~95 M Compounds ~235 M Bioactivities <-> ~5 M Substances ~235 M Bioactivities <-> ~3 M Compounds ~235 M Bioactivities <-> ~1 M Bio. Assays ~10 M PMIDs <-> ~100 K Compounds ~3 M Patents <-> ~30 M Substances ~3 M Patents <-> ~20 M Compounds … … 14
https: //pubchemdocs. ncbi. nlm. nih. gov/rdf Overview of Pub. Chem. RDF (+137 B triples) Prefix/Namespace compound 2 D neighboring links 3 D neighboring links substance descr inchikey syno bioassay measuregroup endpoint protein conserveddomain biosystem gene reference source concept Total number of triples: Total number of triples subjects 2, 419, 866, 089 96, 529, 946 94, 030, 684, 717 32, 373, 792, 809 1, 704, 061, 017 385, 048, 008 5, 492, 982, 100 2, 553, 616, 642 289, 023, 744 96, 242, 458 530, 209 191, 760, 184 97, 309 23150 241, 983, 687 1, 080, 039 518, 272, 329 238, 095, 963 3, 978, 203 18, 093 55, 528 3, 346 6, 468, 345 646, 258 3, 486, 043 57, 806 267, 136, 198 12, 742, 526 2, 367 561 30, 065 6, 027 137, 882, 450, 759 15
Bridging text/chemicals • Matching chemical names to Me. SH (Medical Subject Headings) to Pub. Med • Contributed content direct from chemistry publishers Pub. Chem to/from Pub. Med – Springer Nature, Thieme assert chemicals found in their publications – Nature Chemistry, Nature Chemical Biology directly contribute author-based chemical structures Pub. Med is used by millions of users a day Pub. Chem is used by millions of users a month
Summarizing and linking out compound information in Pub. Chem using Pub. Med records There are 38 K Pub. Med records mentioning Cyclophosphamide. q What can we learn from them? q How? Annotate Pub. Chem records Match to Pub. Chem synonyms A Pub. Med record related to anti-cancer drug Cyclophosphamide. o o o Cyclophosphamide is mentioned several times Cancers are mentioned several times Therapies are mentioned several times Cancer-related cells are mentioned several times Mouse models are mentioned many times Match terms to other databases Find significant frequent entities Aggregate/abstract/summarize Thanks Leonid Zaslavsky!
Pub. Chem Co-occurrence displays • Using Pub. Med corpus • Mine text title/abstract • Find all chemical mentions – Requires knowing what chemical names are possible • Compute histogram and provide top-20 • Evidence clearly stated – name, PMIDs. . downloadable
Chemical to Disease or Gene/Protein (provides rapid summary of entities along with evidence) We call these knowledge panels. . more than 100 K chemicals with such displays
Pub. Chem helps make chemical content findable • Chemical structures – Structure search • Chemical names – Sucrose, sugar • In. Ch. IKey – Computable identifier – Google-able chemical structure CZMRCDWAGMRECN-UGDNZRGBSA-N S u c r o s e
• Nomenclature ‘Standards’ heavily leveraged in chemistry – Chemical naming • Used in software for structure name and name structure – Biological Line notations (with IUBMB) for saccharides, proteins (i. e. , large molecules) – Terminology • Color books beta-D-arabino-hex-2 -ulofuranosyl alpha-Dgluco-hexopyranoside Glc(a 1 -2 b)Fruf Sucrose
Standards examples • In. Ch. I – International Chemical Identifier In Ch • Standard – – Initially created by NIST Under auspices of IUPAC (2004) Open source, non-proprietary Layered design In. Ch. I=1 S/C 12 H 22 O 11/c 13 -1 -46(16)8(18)9(19)11(21 -4)23 -12(315)10(20)7(17)5(2 -14)22 -12/h 4 -11, 13 -20 H, 1 -3 H 2/t 4 -, 5 -, 6 -, 7 -, 8+, 9 -, 10+, 11 -, 12+/m 1/s 1 • Algorithm – Normalizes chemical representation – Includes layered ‘hashed’ form called an In. Ch. IKey (fixed length) CZMRCDWAGMRECNUGDNZRGBSA-N Version Type Chemical formula Connectivity Charge&Proton Stereochemical Other (e. g. , Isotopic)
Standards still being developed: Reactions RIn. Ch. I multi-component system notation Esterification of acetic acid with ethanol to acetic acid and water catalyzed by sulfuric acid: RIn. Ch. I=0. 03. 1 S /C 2 H 4 O 2/c 1 -2(3)4/h 1 H 3, (H, 3, 4)! C 2 H 6 O/c 1 -2 -3/h 3 H, 2 H 2, 1 H 3 <> C 4 H 8 O 2/c 1 -3 -6 -4(2)5/h 3 H 2, 1 -2 H 3! H 2 O/h 1 H 2 <> H 2 O 4 S/c 1 -5(2, 3)4/h(H 2, 1, 2, 3, 4) /d+ • "<>" separates reactants, products, and agents (= catalysts, solvents, etc. ) • ”!" separates components within these groups • alphabetical order of components within groups • /d+ layer describes the direction of the reaction • ("RIn. Ch. I=0. 03. 1 S” version identifier)
Standards still being developed: Mixtures MIn. Ch. I example w/ concentration range notation 37% wt. Formaldehyde in Water with 10 -15% Methanol: MIn. Ch. I=0. 0 S/ CH 2 O/c 1 -2/h 1 H 2& CH 4 O/c 1 -2/h 2 H, 1 H 3& H 2 O/h 1 H 2 /n{1&2&3} /g{37 wf-2&10 -15 vf-2&} • alphabetical order of components • ”&" separates components • ”{}” denotes mixture groups • “/n” layer indexes components (e. g. , order) • “/g“ layer notates concentration (symbols detailed separately) • (”MIn. Ch. I=0. 0 S” version identifier)
Example use case for In. Ch. IKey: Quickly map all your chemical records to a resource • Pub. Chem FTP site provides a complete tab-delimited file containing all +95 M chemicals – Pub. Chem Compound Identifier (CID), In. Ch. IKey ftp: //ftp. ncbi. nlm. nih. gov/pubchem/Compound/Extras/CID-In. Ch. I-Key. gz • Compute In. Ch. I/Key for your records • Compare In. Ch. I/Key strings (or DB join) In. Ch. I/Key == Your chemical collection 25
The world is changing rapidly • Digital assistants – Siri, Cortana, Alexa, … • We want the computer to help us… therefore, the computer must understand Image credit (cropped): http: //i 2. cdn. turner. com/money/dam/assets/150728115916 -smart-assistants-figures-780 x 439. jpg
How do you teach a computer chemistry? Image credit: https: //whatsthebigdata. files. wordpress. com/2016/10/ai_data-science-diagram. jpg? w=640 https: //media-exp 2. licdn. com/mpr/AAEAAQAAAAWIAAAAJGM 1 MGM 3 NGYw. LWU 5 ZWYt. NDkz. Y S 04 Zm. Ex. LTJj. ZWQ 1 ZWQx. Yz. Ni. Yw. png
For computers to understand chemistry they need our help, but our workflows are for humans Read papers Do science Search papers Publish papers Computers help humans at every step. The better they understand what we do and how we do it, the better they can help us do just about everything. Image credits: http: //www. how-to-draw-funny-cartoons. com/cartoon-scientist. html http: //computertutorinc. net/computer-maintenance-safety-tips/
Scientific Information PDF Text Tables Presentation Images HTML Schemes Data Human understanding Text Figures
Scientific Information Text PDF Presentation HTML Schemes Images Tables Computers just don’t get it Data Figures Text Image credit: https: //static. spiceworks. com/attachments/cms/0000/2161/sad-computer. png
What is “mipa”? • • Chemical – MIPA: monoisopropyl amine Gene – MIPA: Mlt. A-interacting protein Measurement – MIPA: Multivariate inference of pathway activity Organization – MIPA: Massachusetts Independent Pharmacists Association Company – MIPA: MIPA SE or MIPA AG Action – MIPA: Missile Procurement, Army Award – MIPA: Musikmesse International Press Award …
Chemical (structure) information is rather troublesome. . .
A chemical structure may be represented in many different ways Solved. . use In. Ch. I/Key Tautomers and resonance forms of same chemical structure are prolific
A chemical structure may be represented in many different ways We know how to handle this. . normalize structural forms Salt-form drawing variations are common
What do you mean by “sodium acetate”? Hmm… need to understand context (aqueous or solid) ? ? Chemical meaning of a substance may change upon context
Structure/Name - many to many relationships Carbon Element? Coal? Diamond? Methane? Structure Concept Gas or Liquid or Polymer? Liquid: flammable or inflammable? Concept Gleevec Salt? Hydrate? Free base? Formaldehyde Structure How to represent? Structure This is a really tough problem. . we are working on it Structure Concept
Metals and Oxygen • How to interpret a Metal-Oxygen bond? – Metal oxide? – Metal hydrate? M–O Iron hydroxide • Missing hydrogens. . use of covalent bonds . . no metal bonds. . missing 3 -D structure – High-spin, low-spin? Iron oxide
Even elements can be troublesome • Chemical diagrams often use abbreviations that can be mistaken for something else Mt = Meitnerium
Even elements can be troublesome • 99 m. Tc is a metastable form of 99 Tc used to image cancer Image credit: http: //th. physik. uni-frankfurt. de/~scherer/Blogging/Tc 99/decays_scheme. png Image credit: http: //www. auntminnie. com/user/images/content_images/sup_mol/2013_06_11_17_05_49_101_snmmi_prostate_mip 1404_450. jpg
Chemical toolkits don’t behave the same for the same structure
With whom has this structure been? Chem Draw Different structure packages support structural information content to different extents Open Babel Biovia Open Eye Chem Axon Different software packages may ‘normalize’ your chemical structure in different ways
Where has this structure been? PDB Export between file format flavors can result in reinterpretation of structure (2 -D coordinates, atom types, bond types, etc. ) CDX SDF Chemical file format interconversion can be lossy SMILES In. Ch. I
Free flow of information Download • Algorithms interpret information • Can introduce ambiguity, errors • Later use can mislead • Now cycle this many times Upload Interpret Normalize
Chemical structure information can (irreversibly) change when exchanging between file formats and software packages • Computed IUPAC In. Ch. I (and IUPAC systematic name) may differ after data exchange • Scientists often unaware of what can occur during data exchange – Implicit (presentation) vs. explicit (machine interpretable) information – Loss of information (e. g. , coordinates, relative stereo) when using different formats • Software can help correct or warn the scientist of issues – Ambiguous stereo centers or missing explicit parity information – Tautomeric/resonance systems containing stereo centers/bonds • Lack of agreed processing rules between software packages and publicly accessible databases – Same input can produce different output (still the same chemical structure? ) – Proliferation of different structure variations of the same chemical with different In. Ch. I • Free flow of chemical information makes establishment of best practices, adherence to standards, and scientist education of utmost importance
understanding v The chemical information divide Human understanding Depictions Schemes Table Text Human Intent Image credit: https: //hcldr. files. wordpress. com/2016/12/1 -rr-0116 -industry-divided. jpg Computer understanding Explicit Complete Annotated Interpreted
Evidence-based approach • Does an entity or link between entities have any ‘proof’? – Was it mentioned in a paper or patent? – Is there a data set about it? – Is there a terminology that references it? – Did biocurators curate it? Chemical (compound, substance, …) Publication Target Disease (gene, protein, …) (phenotype, syndrome, …)
Getting back to the original information Acetone Boiling Point Document (DOI, …)
Find any evidence that an entity exists PMID Entity DOI Evidence Ontology / Terminology Identifier Patent Identifier • Entity is anything we care about: chemical, target, disease, … (100 s of millions) • Gather evidence/proof entity exists – Use available text corpus – Examine all ontologies – Scope dictated by entity types and use cases
Find evidence that a link between entities exists Entity - Entity DOI PMID Evidence Ontology / Terminology Identifier Patent Identifier • Entity links are any reported (100 s of billions) • Gather evidence/proof entity link exists – Biocuration collections – Mentioned – Determine context • What about the two entities?
Some large document (meta) data sets (can we harness these effectively? ) • Biomedical – Pub. Med (NLM) – 30 M • and EPMC (EBI) • Agricultural – Agricola (NAL) – 10 M • Scientific data – Data. Cite – 20 M • General Science – Cross. Ref – 100 M – Sci. Graph (Springer Nature) – 20 M • Patent publications – USPTO – 10 M – EPO, JPO, CPO, WIPO …
Chasing windmills? Annotating and FAIR-ifying scientific content can be difficult to navigate: • Identifiers, licensing/IP, standards, terminologies, normalization, best practices, machine accessibility, scientist education… • What you can do today may be different from tomorrow • Everything a work in-progress Image credit: https: //networkingnerd. files. wordpress. com/2012/04/donquixote-windmill. jpg
Pub. Chem Crew … Evan Bolton Jie Chen Qingliang “Leon” Li Tiejun Cheng Ben Shoemaker Asta Gindulyte Paul Thiessen Jane He Bo Yu Siqian He Leonid Zaslavsky Sunghwan Kim Jian “Jeff” Zhang Special thanks to the NCBI Help Desk, especially Rana Morris, and past Pub. Chem group members 53
Special thanks • Software collaborators – Next. Move Software (Roger Sayle, Daniel Lowe, Noel O’Boyle, John May) – Xemistry Gmb. H (Wolf D. Ihlenfeldt) – Open. Eye Scientific Software • Chemical Health and Safety collaborators – Especially: Leah Mc. Ewen (Cornell U. ), Ralph Stuart (Keene State College • • Pub. Chem. RDF Collaborators Bio. Hackathon (2014 -2018) All Pub. Chem Contributors and Collaborators This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. 54
Have any questions? 55
- Slides: 55