Computational Challenges in Metabolomics Part 1 David Wishart
Computational Challenges in Metabolomics (Part 1) David Wishart, University of Alberta Dagstuhl Seminar on Computational Mass Spectrometry Schloss Dagstuhl, Germany Aug. 23 -28, 2015
Environmental Influence Physiological Influence The Pyramid of Life Metabolomics Metabolome Proteomics Proteome Genomics Genome
Why Small Molecules Count • 100% of all agricultural products (herbicides, pesticides, fertilizers) are small molecules • >99% of all compounds that give food or drinks their aroma, color and taste are small molecules • 91% of all known drugs are small molecules • >85% of all common clinical assays test for small molecules • 60% of all drugs are derived from pre-existing metabolites • 10 -15% of identified genetic disorders involve diseases of small molecule metabolism
Proteomics vs. Metabolomics
Proteomics vs. Metabolomics • • • • Very MS or MS/MS oriented Good separation is critical Generates lots of raw data Peptide and protein ID Isotopic labeling (ICAT) helps Possible to derive 3 D structure Permits protein imaging Very dependent on databases Spectral processing and deconvolution is challenging Quantitation is challenging Data analysis requires MV stats Data integration is challenging Better software is needed • • • • Very MS or MS/MS oriented Good separation is critical Generates lots of raw data Chemical ID Isotopic labeling (SIL) helps Possible to derive 3 D structure Permits metabolite imaging Very dependent on databases Spectral processing and deconvolution is challenging Quantitation is challenging Data analysis requires MV stats Data integration is challenging Better software is needed
Proteomics vs. Metabolomics
Proteomics Workflow Biofluid/Extracts Protein ID HPLC or PAGE Tryptic Digest Mass Fingerprint MALDI plate MS analysis
Protein ID by PMF-MS
Metabolomics Workflow Biological or Tissue Samples Compound ID Extraction LC/GC-MS Spectra Biofluids or Extracts LC-MS or GC-MS
Compound ID by GC/LC-MS LC/GC-MS total Ion chromatogram CH 3
Proteomics vs. Metabolomics • Polymers of 20 amino acids (chemically similar) • 185 million sequences (from DNA sequencing) • Sequence defines MS & MS/MS spectra • Trypsin gives definable cleavages • MS alone can ID proteins (PMF) • MS/MS fragmentation at 1 fixed energy • MS/MS fragmentation is easily predictable and very distinct • 30 common PTMs • PTMs are somewhat predictable • 1000 s of distinct chemical classes (chemically diverse) • No information from DNA sequencing • Structure defines MS & MS/MS spectra (adducts, fragments) • No trypsin for small molecules (CID only) • MS alone cannot ID metabolites • Different energies for different molecules • MS/MS & EI-MS fragments not easily predictable, often similar • >400 PTMs via metabolism • PTMs are hard to predict
Challenges for Metabolomics • Most MS-based metabolomics studies ID <100 cmpds (<1% of the known metabolome) • Metabolite ID requires accurate, referential MS/MS or EI-MS spectra and/or RT information • Limited experimental MS/MS, EI-MS & RT data • The chemical space of most metabolomes is not fully known (perhaps >5 million compounds total) • <1% of the chemicals in Pub. Chem are relevant to metabolomics • Metabolomics needs specialized compound and spectral (MS/MS, EI-MS, NMR) databases • Metabolomics needs computational tools to predict biologically viable metabolites and their spectra
LC-MS Spectral DBs • • Mo. NA – 236, 604 spectra, 69, 946 cmpds** (12, 000) METLIN – 68, 124 spectra, 13, 048 cmpds mz. Cloud – 422, 349 spectra, 2975 cmpds NIST 14 MS/MS – 234, 284 spectra, 9344 cmpds Mass. Bank – 28, 185 spectra, 11, 500 cmpds Wiley LC-MSn – >10, 000 spectra, 4500 poisons Re. Spect – 9107 spectra, 3595 cmpds GNPS – 9000 spectra, 4200 natural products Total #compounds with exp. MS/MS spectra ~20, 000 Less than 60% are biologically relevant
How to Get Missing Spectra? • Obtain or synthesize all biologically relevant molecules (metabolites, HPVs, drugs, pollutants, foods, etc. ), prepare or synthesize their metabolites and collect their NMR, LCMS and GC-MS spectra COST - 5, 000 cmpds X $1000/cmpd = $5 billion • OR • Do this entire exercise computationally COST - 5, 000 cmpds X $0. 10/cmpd = $500, 000
Computational Metabolomics Known biomolecules (50, 000) Match observed spectra to predicted spectra to ID compounds Predicted biotransformations (50, 000 --> 5, 000) Predicted MS/MS, NMR, GC-MS Spectra of knowns + biotransformed
The Human Metabolome Database (HMDB) http: //www. hmdb. ca • A web-accessible resource containing detailed information on 41, 993 “quantified”, “detected” and “expected” metabolites • Data mined from the literature and other e. DBs • 100’s of drug metabolites • 1000’s of xenobiotics • >10, 000 reference spectra • Supports sequence, spectral, structure and text searches as well as compound browsing • Full data downloads
The Drug Database (Drug. Bank v. 4. 3) • • • http: //www. drugbank. ca 1602 small molecule drugs >5000 experimental drugs Data mined from the literature and other e. DBs >1000 drugs with metabolizing enzyme data >1200 drug metabolites >600 MS+NMR spectra >4200 unique drug targets 208 data fields/drug Supports sequence, spectral, structure and text searches as well as compound browsing Full data downloads
The Toxic Exposome Database (T 3 DB) http: //www. t 3 db. ca • Comprehensive data on toxic compounds (drugs, pesticides, herbicides, endocrine disruptors, drugs, solvents, carcinogens, etc. ) • Data mined from the literature and other e. DBs • >3600 toxic compounds • >1900 reference spectra • ~2100 toxic targets • Supports sequence, spectral, structure, text searches as well as compound browsing • Full data downloads
Computational Metabolomics Known biomolecules (50, 000) Match observed spectra to predicted spectra to ID compounds Predicted biotransformations (50, 000 --> 5, 000) Predicted MS/MS, NMR, GC-MS Spectra of knowns + biotransformed
Secondary Metabolism CH 3 Tempazepam Oxazepam Diazepam N-(2 -Benzoyl-4 -chlorophenyl)-2 acetamidoacetamide Nordazepam
Bio. Transformer
Bio. Transformer - Flowchart Query Molecule Other Reactions Phase I Reaction-specific structural constraints Enzyme metabolite? (Machine Learning) YES NO SOM Predictor (Machine Learning) YES SOMs Metabolite Generator NO No metabolites Metabolites All structures are generated as SMILES, SDF or MOL files NO
Bio. Transformer – SOM Prediction • • Preference Learning based on 100 atomic (e. g. atom type) and 10 molecular features (e. g. mass) SOM predictor was trained for 9 CYP 450 s Average Prediction accuracy of 84. 54% Structures generated based on 92 Phase I reactions
Bio. Transformer Results 6, 230 Phase I metabolites ? 9, 510 Phase II metabolites ? 6, 110 Microbial metabolites ? 12, 340 Other metabolites ? 5, 000 compounds 34, 000 metabolites ~220, 000
Computational Metabolomics Known biomolecules (50, 000) Match observed spectra to predicted spectra to ID compounds Predicted biotransformations (50, 000 --> 5, 000) Predicted MS/MS, NMR, GC-MS Spectra of knowns + biotransformed
Computational Challenges in Metabolomics (Part 2) Sebastian Böcker, Friedrich Schiller University Dagstuhl Seminar on Computational Mass Spectrometry Schloss Dagstuhl, Germany Aug. 23 -28, 2015
- Slides: 26