Sure Ch EMBL webinar George Papadatos Mark Davies
Sure. Ch. EMBL webinar George Papadatos Mark Davies surechembl-help@ebi. ac. uk
Outline • Sure. Ch. EMBL • Coverage and content • Capabilities • Future plans • Interface demo
Ch. EMBL: Data for drug discovery 1. Scientific facts 3. Insight, tools and resources for translational drug discovery Assay/Target >Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE Compound Ki = 4. 5 n. M Bioactivity data APTT = 11 min. 2. Organization, integration, curation and standardization of pharmacology data
Why is searching chemical patents useful? • Infringement search to avoid areas of valid patent protection (freedom to operate) • Search for industrial profiles and research directions (competitive intelligence) • State-of-the-art/novelty/prior art search • Search for citations and key references • Most of the knowledge in chemical patents will never appear anywhere else • Average time lag between patent and journal: 3 years • Compounds, scaffolds, reactions • Biological targets, diseases, indications
Sure. Ch. EMBL
Sure. Chem becomes Sure. Ch. EMBL • December 2013 EMBL-EBI acquired Sure. Chem – a leading chemistry patent mining product from Digital Science, Macmillan Group • Sure. Chem not aligned with core future academic business • Existing Sure. Chem user base • Free (Sure. Chem. Open) • Paying (Sure. Chem. Pro + API) • EMBL-EBI supported existing licensees during transition • EMBL-EBI provides an ongoing, free and open resource to the entire community • Rebranded as Sure. Ch. EMBL
Sure. Ch. EMBL data pipeline Patent Offices WO Sure. Ch. EMBL System 1 -[4 -ethoxy-3 -(6, 7 -dihydro-1 -methyl-7 -oxo-3 -propyl 1 H-pyrazolo[4, 3 -d]pyrimidin-5 -yl)phenylsulfonyl]-4 methylpiperazine Chemistry Database Sure. Chem IP EP Applications & Granted US Applications & granted JP Abstracts Processed patents (IFI Claims) Entity Recognition OCR Name to Structure (five methods) Database Image to Structure (one method) API Application Server Users Patent PDFs (service)
Sure. Ch. EMBL chemistry data coverage • Structures from text: 1976 onwards • Title, abstract, claims, description • IUPAC, trivial, drug names, etc. • Sure. Chemical Entity Recognition proprietary algorithm • ACD/Labs, Chem. Axon, Open. Eye, OPSIN, Perkin. Elmer name-tostructure conversion • Structures from images: 2007 onwards • CLi. DE image-structure conversion • USPTO offers ‘Complex Work Units’ since 2001 • CWU file types include MOL and CDX • CWUs processed as part of pipeline: 2007 onwards
Sure. Ch. EMBL data content (11/03/2015) • 16, 261, 347 unique compounds • 13, 274, 991 chemically annotated patents • ~80, 000 novel compounds extracted from ~50, 000 new patents monthly • 2– 7 days for a published patent to be chemically annotated and searchable in Sure. Ch. EMBL • Sure. Ch. EMBL provides search access to all patents (not just chemically annotated ones) • ~120 M patents
EMBL-EBI chemistry resources RDF and REST API interfaces Atlas PDBe Ligand induced transcript response Ligand structures from structurally defined protein complexes 750 15 K Ch. EBI Nomenclature of primary and secondary metabolites. Chemical Ontology 24 K Ch. EMBL Sure. Ch. EMBL Bioactivity data from literature and depositions Chemical structures from patent literature 1. 5 M ~16 M Uni. Chem – In. Ch. I-based chemical resolver (full + relaxed ‘lenses’) REST API Interface - https: //www. ebi. ac. uk/unichem/ 3 rd Party Data ZINC, Pub. Chem, Thomson. Pharma DOTF, IUPHAR, Drug. Bank, KEGG, NIH NCC, e. Molecules, FDA SRS, Pharm. GKB, Selleck, …. ~60 M >80 M
Sure. Ch. EMBL data access I • Uni. Chem (“Universal Compound Resolver”) • Weekly updates • Web service lookup • Connectivity search • https: //www. ebi. ac. uk/unichem/ • FTP download • Quarterly updates • All Sure. Ch. EMBL compounds in SDF and CSV format • Raw data • ftp: //ftp. ebi. ac. uk/pub/databases/chembl/Sure. Ch. EMBL/
Sure. Ch. EMBL data access II • Pub. Chem • Sure. Ch. EMBL data source • Quarterly updates • Data feed client • Creates a local replica of Sure. Ch. EMBL • Updates daily • http: //vartree. blogspot. co. uk/2015/01/how-to-create-yourown-replica-of. html
Can we have everything? Cost Quality Time
Common sources of errors • Small, poor quality images • OCR errors in names (OCR done by IFI). There is an OCR correction step, but cannot fix all errors -> ‘ 2, 6 -Difluoro-Λ/-{1 -r(4 -iodo-2 -methylphenyl)methvn-1 H-pyrazol-3 v. Dbenzamide’ • Reliability better for US patents due to inclusion of mol files
Bioactivity data extraction? Compounds Target/Assay Bioactivity
Markush structure extraction? -alkyl -aryl -heterocyclyl -cycloalkyl ….
Future plans • Full compound-patent map • Flat file ftp download • Coming in March • Regular updates • Also available in Uni. Chem • Open. PHACTS ENSO project • Biological entity extraction and annotation • Proteins, genes and diseases • Ontology mapping and semantic integration
Sure. Ch. EMBL Interface
Homepage Search by keyword Search by patent number Chemical search type filter (substructure, similarity, identical) Search by chemical structure (sketch compound) Help Filter by authority (US, EP, WO and JP) Filter by date Filter by MW Search by SMILES, MOL, SMARTS, name Filter by document section (title, claims, abstract, description and images)
Keyword-based search • Uses Boolean operators and Lucene query fields • Example searches… • roche OR novartis • sterili? e • kinase* • pfizer C 07 D “kinase inhibitor” • pn: WO 2011058149 A 1 • pa: (bayer OR genentech OR merck) AND desc: (chemotherap* AND (“phosphoinositide kinase”~0. 8 OR Pi 3 K))
Lucene Field scpn pn pd an ad pridate pdyear ds Description Sure. Ch. EMBL Patent Number (SCPN) publication number publication date application number application date priority(ies) all priority dates publication year designated states PCT publication number PCT publication date PCT application number PCT application date related application number Indexed Data EP-0555555 -B 1 EP 0555555 B 1 20120101 EP 06009700 A 20061213 DE 19958719 A 19991206 20000913 2013 DE GB WO 2006098969 A 2 20060921 US 2006008177 W 20060308 Division of application No. 12/159, 232 Sample scpn: EP-0555555 -B 1 pn: ep 0555555 b 1 pd: 20120101 an: EP 06009700 A ad: 20061213 pri: “DE 19958719 A 19991206” pridate: 20000913 pdyear: 2013 ds: (DE OR GB OR FR) ds: FR pctpn: WO 2006098969 A 2 pctpd: 20060921 pctan: US 2006008177 W pctad: 20060308 relan: US 15923208 pctpn pctpd pctan pctad relan relad ic cpc related application date IPCR CPC Jun 26, 2008 C CO 8 C 08 K 0005 C C 07 D 047104 relad: 20080626 ic: C cpc: C 07 D ecla uc inv ECLA US class inventor(s) C 07 D 487/10 29 schmidt hans-werner ecla: C 07 D 487/10 uc: 029 inv: ("schmidt hans" AND thelakkat) apl applicant Sony International (Europe) Gmb. H apl: sony asg pa apl or asg cor agt pcit ncit assignee(s) or applicant(s) correspondent agents patent citations non-patent citations SIEMENS AKTIENGESELLSCHAFT see apl and asg above Dr Roger Brooks Pohlman, Sandra M EP 0748154 B 1 TANG C W: ”Two-layer organic photovoltaic cell” ttl title in English, French and German Sonnenenergiesystem ab desc clm text abstract in English, French and German description in English, French and German claims in English, French and German abstract or description or claims in English, French or German publication language EN FR DE PT NO RU NL SV FI TR IS and more asg: siemens pa: sony cor: “Dr Roger Brooks” agt: ”Pohlman, Sandra M” pcit: EP 0748154 B 1 ncit: (tang AND ”Two-layer organic photovoltaic cell”) ttl: (”solar energy” OR “énergie solaire” OR Sonnen*) pnlang: (NO OR FI OR SV)
Fielded keyword search Keyword search Filter by document section Logical operators
Sure. Ch. EMBL Patent Numbers (SCPN) • Standardised format used to search system • Format: CC-PATNO-KK, e. g. WO-2011161255 -A 2 • Batch conversion available via interface homepage link
Keyword searches return documents
Patent family members
Export patent chemistry Property range filters Count filters Go to ‘My Exports’ to download CSV or XML
Patent view - Front page
Patent view - Claims
Chemical entities in patent Click on blue highlighted text to see chemical info box
Patent view - Tools Access to source document PDF Export chemistry for document or family
Chemistry-based searching Structure sketch (2 sketchers) Types of search Filter by MW range Filter by document section
Chemistry searches return structures Tautomers are registered as different structures, unlike in Ch. EMBL – this will likely change in future
Review chemistry hits
Compound report page Uni. Chem integration: On-the-fly integration with ~81 M structures and from 28 data sources
Review patent documents for chemistry
Review patent documents for chemistry
Sure. Ch. EMBL knowledge base
Sure. Ch. EMBL support surechembl-help@ebi. ac. uk
Ch. EMBL blog http: //chembl. blogspot. co. uk/
Summary • Sure. Ch. EMBL • Coverage and content • Capabilities • Future plans • Interface demo
Acknowledgements • Ch. EMBL team • John Overington • Jon Chambers • George Papadatos Funding: Innovative Medicines Initiative Joint Undertaking, grant agreement no. 115191 (Open PHACTS) • Mark Davies Wellcome Trust Strategic Award for Chemogenomics, WT 086151/Z/08/Z • Nathan Dedman European Molecular Biology Laboratory • Anna Gaulton European Commission FP 7 Capacities Specific Programme, grant agreement no. 284209 (Bio. Med. Bridges) • Digital Science • Nicko Goncharoff • James Siddle Software: • Richard Koks • Open PHACTS consortium • http: //www. openphacts. org/partners/consortium
Future webinars: • 25 th March - Uni. Prot: Exploring protein sequence and functional information • 8 th April - Introduction to ENA • 22 nd April - Ensembl Tools • 6 th May - Reactome: Exploring biological pathways All webinars @ 4: 00 pm GMT For details see: http: //www. ebi. ac. uk/training/online/emblebi-training-webinar-series-2015
my. Ch. EMBL Example
What is my. Ch. EMBL? • A Virtual Machine, preloaded with… • A complete version of the Ch. EMBL database • Chemical structure searching • GUI & web services for accessing the database • A suite of chemoinformatics and data analysis tools • Tutorials on a range of topics • Using Ch. EMBL data • Chemoinformatics, machine learning, etc. • Completely free and open
my. Ch. EMBL: Applications • Centralised Resource • VM shareable across the local network • Access to standardised tools, services and data • Application Development • Sandboxed VM, all source code available • Learning • Lowers ‘activation barrier’ with pre-installed tools and examples • Teaching, Training & Dissemination • IPython notebooks and KNIME • 2 nd Prize at ACS Teach-Discover-Treat competition
my. Ch. EMBL Launch. Pad
Sure. Ch. EMBL and my. Ch. EMBL More: http: //chembl. blogspot. co. uk/2014/10/mychembl-19 -released. html Download: ftp: //ftp. ebi. ac. uk/pub/databases/chembl/VM/my. Ch. EMBL/current/
Sure. Ch. EMBL and my. Ch. EMBL http: //nbviewer. ipython. org/github/rdkit/UGM_2014/blob/master/Notebooks/Vardenafil. ipynb
- Slides: 48