Use of Semantic Technologies at Eli Lilly and

  • Slides: 35
Download presentation
Use of Semantic Technologies at Eli Lilly and Company J Phil Brooks Information Consultant,

Use of Semantic Technologies at Eli Lilly and Company J Phil Brooks Information Consultant, SE Data Team Discover IT Eli Lilly and Company Copyright© 2008 Eli Lilly and Company

Agenda • Project Overviews • Discovery Metadata • Integrative Informatics • POC 4 •

Agenda • Project Overviews • Discovery Metadata • Integrative Informatics • POC 4 • POC 1 • Metadata Repository • External Collaborations • Conclusions • Acknowledgements 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 2

Use of Semantic Technologies at Eli Lilly and Company Project Overview: Discovery Metadata Copyright©

Use of Semantic Technologies at Eli Lilly and Company Project Overview: Discovery Metadata Copyright© 2008 Eli Lilly and Company

Discovery Metadata: Goals Integrate Master Data throughout the pharmaceutical discovery process to enable information

Discovery Metadata: Goals Integrate Master Data throughout the pharmaceutical discovery process to enable information sharing/integration for scientific community • Model key relationships between Master Data classes • Provide ability to integrate disparate data sets quicker than the normal warehouse paradigm typically allows • Create a re-usable and sustainable semantic implementation • Allow for user-driven, manual curation of key data relationships • Develop core competencies in Semantic Web technologies within Eli Lilly • Position the Semantic Web within Eli Lilly • Strengths • Weaknesses • When to use? 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 4

Discovery Metadata: Ontology 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an

Discovery Metadata: Ontology 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) SAP Legacy REFDB GSM NCBI 5 Manual Curation

Discovery Metadata: Architecture A P P S Application 1 S O A Application 2

Discovery Metadata: Architecture A P P S Application 1 S O A Application 2 SOA Layer/Enterprise Service Bus (Web. Services, Visualizers, Data. Access Components) SQL D A T A Source Model 1 Source Model 2 Source Model 3 Source Model 4 Authentication SPARQL Other Sources … Sources Rdbms 3/7/2021 … Application 3 Local Assertions ETL Top Level Ontology Provenance Other Tools Spreadsheets Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 6

Discovery Metadata: Implementation • Oracle Semantic Technologies 11 g • Top. Braid Composer, Maestro

Discovery Metadata: Implementation • Oracle Semantic Technologies 11 g • Top. Braid Composer, Maestro Edition v 2. 6. 2 • Multiple Oracle models segregated by source • • Top-Level Ontology Enterprise data sources (3) External data sources (NCBI) Custom/Local assertions (2) • ~ 4. 4 M triples • Loaded triples: 2. 1 M • Inferred triples: 2. 3 M • Custom-developed browser • Metadata-driven web service providing cross-application access to master data 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 7

Discovery Metadata: Future Work • Implement provenance at the instance level • Integrate additional

Discovery Metadata: Future Work • Implement provenance at the instance level • Integrate additional data sources (Me. SH, Gene Ontology, KEGG, internal data sources) • Operationalize load processes • Finalize visualization standards • Performance reviews (scalability) 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 8

Use of Semantic Technologies at Eli Lilly and Company Project Overview: Integrative Informatics Copyright©

Use of Semantic Technologies at Eli Lilly and Company Project Overview: Integrative Informatics Copyright© 2008 Eli Lilly and Company

Integrative Informatics: Overview • The focus of Integrative Informatics is to facilitate data integration

Integrative Informatics: Overview • The focus of Integrative Informatics is to facilitate data integration between the discovery and medical components with Eli Lilly • Their methodology is to execute Proofs of Concept (POC) projects to identify, construct, and test various solutions for solving the integration problem • Efforts: • POC 1: CATIE project • POC 4: Endocrine PI Competitive Intelligence • Generic Browser efforts 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 10

Integrative Informatics: POC 1 - CATIE Semantic Integration What is the CATIE study? •

Integrative Informatics: POC 1 - CATIE Semantic Integration What is the CATIE study? • Clinical Antipsychotic Trials of Intervention Effectiveness • Was the most comprehensive independent trial ever completed to examine existing anti-psychotic therapies for schizophrenia • Provides detailed information comparing the effectiveness and side effects of five medications currently used to treat schizophrenia • Olanzapine • Quetiapine • Risperidone • Ziprasidone • Perphenazine • Greatly enhances the knowledge available to guide treatment choices for people with schizophrenia Source: Chandra Ranga Gudivada 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 11

Integrative Informatics: POC 1 Goals • Determine whether semantic integration and analysis of the

Integrative Informatics: POC 1 Goals • Determine whether semantic integration and analysis of the CATIE data set in the context of metabolic and signal transduction pathways with receptor affinities can provide answers to specific scientific questions: • Which pathways are associated with response to the 5 different schizophrenia drugs? • How do these pathways compare between treatment arms? • Which receptors are associated with response to the 5 schizophrenia drugs? • How are the pathways, receptors and the drug response genes from the CATIE data set related? Source: Chandra Ranga Gudivada 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 12

Integrative Informatics: POC 1 Data Aggregation • CATIE Drugs: • Olanzapine • Perphenazine •

Integrative Informatics: POC 1 Data Aggregation • CATIE Drugs: • Olanzapine • Perphenazine • Quetiapine • Risperidone • Ziprasidone • Datasets: • Entrez Gene • Pubchem (for CATIE Drugs) • Assay (Receptor Affinity Data for CATIE) • KEGG • Reactome • Biocyc • Transpath Source: Chandra Ranga Gudivada 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 13

Integrative Informatics: POC 1 Architecture Data in Multiple Formats (Flat file, Tab limited, XML)

Integrative Informatics: POC 1 Architecture Data in Multiple Formats (Flat file, Tab limited, XML) Top – Level Ontology RDF conversion using Jena Programming API Oracle 11 g RDF store Allegrograph Native RDF Triple Store Perform SPARQL Querying Source: Chandra Ranga Gudivada 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 14

Integrative Informatics: POC 1 Conclusions • Efficient semantic integration can be accomplished by using

Integrative Informatics: POC 1 Conclusions • Efficient semantic integration can be accomplished by using RDF • Powerful complex data modeling can be achieved by using graph principles inherent in RDF • Easy translation of scientific questions to graph queries can be accomplished using SPARQL and SEM_MATCH • Customized outputs can easily be generated by making slight changes in the SPARQL query pattern Source: Chandra Ranga Gudivada 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 15

Integrative Informatics: POC 4 - Endocrine PI Competitive Intelligence (CI) is a purposeful, ethical

Integrative Informatics: POC 4 - Endocrine PI Competitive Intelligence (CI) is a purposeful, ethical and cocoordinated monitoring of the competitors in any industry within a specific market place to: • Strategically gain foreknowledge of recent developments of your competitor's plans • Make calculated informed business decisions and formulate operational strategy The purpose of the Endocrine Public Information (PI) project is to provide a mechanism for actively surveying the public information for competitive intelligence on the Endocrine area Source: Chandra Ranga Gudivada 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 16

Integrative Informatics: POC 4 Goals • Does such a competitive intelligence effort significantly benefit

Integrative Informatics: POC 4 Goals • Does such a competitive intelligence effort significantly benefit from a semantic component? • Does the Endocrine PI project significantly benefit from semantic integration? • Are there pre-existing ontologies for Company and method of action (MOA) domains? • Do natural language processing (NLP) or text mining methods work for this kind of data? • Does “buried” knowledge exist within that datasets that can be discovered using inference and reasoning? Source: Chandra Ranga Gudivada 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 17

Integrative Informatics: POC 4 Integration Challenges Syntactic Variations Company Parent – Child Relations ØMerck

Integrative Informatics: POC 4 Integration Challenges Syntactic Variations Company Parent – Child Relations ØMerck & Co Inc ØMerck & Co Ltd ØAlpha-glucosidase inhibitor ØGlucosidase inhibitor alpha MOA ØIGF binding protein-3 stimulator ØIGF binding protein stimulator-3 Semantic Variations Company ØAmgen Boulder Inc ØApplied Molecular Genetics Inc ØSynergen Inc ØAmgen ØSerotonin 2 A receptor antagonists Ø 5 -HT 2 receptor antagonist Ø 5 -HT 2 a antagonist ØPeroxisome proliferator-activated receptor delta MOA antagonist ØPPAR delta antagonist ØMelanin concentrating hormone receptor 1 antagonists ØMCH receptor-1 antagonist Source: Chandra Ranga Gudivada 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 18

Integrative Informatics: POC 4 NLP and Semantic Integration Terms from Thomson – Pharma Raw

Integrative Informatics: POC 4 NLP and Semantic Integration Terms from Thomson – Pharma Raw Endocrine Data Increase in Complexity Bayer Corp Dopamide receptor agonist SGLT inhibitor Eli Lilly NLP Methods Used: • Semantic Normalization Eli Lilly & Co Ltd • Fuzzy Distance STAT transcription factor stimulant STAT stimulator • Ignoring Stop Words Alpha-glucosidase inhibitors Glucosidase inhibitor-alpha • Regular Expressions Peroxisome proliferator-activated receptor delta antagonist PPAR delta antagonist • Tokenization • Rule-based Mapping 5 Hydroxytryptamine 2 C agonist 5 HT 1 c agonist Opioid kappa receptor antagonists Kappa opioid antagonist Serotonin 1 B receptor agonists 5 -HT 1 d beta agonist Source: Chandra Ranga Gudivada 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 19

Integrative Informatics: POC 4 Knowledge Representation Melanin-concentrating hormone receptor antagonists l abe ed. L

Integrative Informatics: POC 4 Knowledge Representation Melanin-concentrating hormone receptor antagonists l abe ed. L err pref has. Sub. Class MOA rd MCH 1 antagonists Melanin concentrating hormone receptor 1 antagonists rdf: type has. Drug Phase 2 rnat has. Compan y has. Therapeutic. Area Obesity alternative. Label MCH receptor-1 antagonist Amgen Boulder Inc Amgen nativ e any omp y. C idiar Subs has l Applied Molecular Genetics Inc el alter Abgenix abe b d. La erre pref rdf: type Company ive. L Blank Node has. Status Disease rdf: type alte has. MOA pe f: ty GPR-24 antagonist has Sub Labe l Synergen Inc sid iary Com pan y Avidia Inc Source: Chandra Ranga Gudivada 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 20

Integrative Informatics: POC 4 Inferencing Given Company Name: Applied Molecular Genetics Inc Case 1

Integrative Informatics: POC 4 Inferencing Given Company Name: Applied Molecular Genetics Inc Case 1 : ‘Without’ Semantic Integration and Inference ‘ 0’ Results Get MOA’s that this company is working on Case 2 : ‘With’ Semantic Integration and Inference Amgen Leptin stimulator Amgen Inc Agouti related protein inhibitor Amgen Neuropeptide Y antagonist Amgen Inc Melanocortin MC 4 antagonist 18 Results Source: Chandra Ranga Gudivada 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 21

Integrative Informatics: POC 4 Conclusions • Semantic Integration (instance mapping using NLP) coupled with

Integrative Informatics: POC 4 Conclusions • Semantic Integration (instance mapping using NLP) coupled with RDF data model was successful in answering questions in Competitive Intelligence • Ontologies provide a powerful framework in providing dictionaries and taxonomical relations that help to reason and inference the data for knowledge discovery • Manual curation is a tedious, error prone and labor intensive-task • A semi-automated intelligent computer-based solution that utilizes Ontologies, Semantic Integration and NLP could drastically reduce manual curation process and maintain high quality information Source: Chandra Ranga Gudivada 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 22

Use of Semantic Technologies at Eli Lilly and Company Project Overview: Metadata Repository Copyright©

Use of Semantic Technologies at Eli Lilly and Company Project Overview: Metadata Repository Copyright© 2008 Eli Lilly and Company

Metadata Repository: Goals Aggregate experiment metadata from a diverse set of LSCDD relational databases

Metadata Repository: Goals Aggregate experiment metadata from a diverse set of LSCDD relational databases into an Oracle Semantic Technologies repository for LSCDD scientific investigation • • Provide a unified vocabulary for LSCDD scientific investigation Avoid a complex architecture and extended development effort Realize benefits in the near-term Preprocess metadata to improve efficiency Characterize the type of questions that ontology should answer Identify stable semantic technologies, do not employ parsers Allow semantic and relational databases to work together Provide browser, visualization, and query access into repository Source: Maurice Manning 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 24

Metadata Repository: Ontology Source: Maurice Manning 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic

Metadata Repository: Ontology Source: Maurice Manning 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 25

Metadata Repository: High-level Architecture • Iterative queries on metadata define items of interest •

Metadata Repository: High-level Architecture • Iterative queries on metadata define items of interest • Metadata and raw data are then aggregated to provide additional context for analysis Query Visualization Experimental Metadata Repository Annotation Services Agilent Expression a. CGH Affy Illumina Expression RNAi Database Screening TMA Mutation SNP Analysis Results Source: Maurice Manning 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 26

Metadata Repository: Implementation • • • Protégé Ontology Editor Oracle Semantic Technologies 11 g

Metadata Repository: Implementation • • • Protégé Ontology Editor Oracle Semantic Technologies 11 g D 2 R Map (Database to RDF Mapping) C# development in Visual Studio 2005 Current data sources include: • • • Expression Data : Affymetrix, Illumina, Agilent a. CGH Data RNAi Screening Data Reagent Data Gene Ontology (GO) Medical Subject Headings (Me. SH) • Currently ~30 million triples Source: Maurice Manning 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 27

Metadata Repository: Conclusion With the implementation of the Metadata Repository, it is now possible

Metadata Repository: Conclusion With the implementation of the Metadata Repository, it is now possible for users to ask questions such as: • • • Get all the interactions for methylases that are involved in Colon cancer. For all these genes, get the expression and a. CGH values for all LSCDD colon cancer samples Find cell lines in which RNAi data has been generated using Dharmacon reagents Retrieve the antibodies that have been used to assess the AKT 1 pathway activity in MCF 7 Find all the experiments that were done using my sample Find all samples which are grade III colorectal cancer. For these sample, retrieve the expression, mutation and a. CGH data Source: Maurice Manning 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 28

Metadata Repository: Future Work • • • Ability to ask more complicated scientific queries.

Metadata Repository: Future Work • • • Ability to ask more complicated scientific queries. Query results will be integrated with raw data in relational data sources to provide the user with a single platform for detailed analysis. With user input, the ontology will evolve to include additional entities and attributes as well as links to other public ontologies. Source: Maurice Manning 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 29

Semantic Technologies (from an old DBA’s perspective) External Collaborations Copyright© 2008 Eli Lilly and

Semantic Technologies (from an old DBA’s perspective) External Collaborations Copyright© 2008 Eli Lilly and Company

The Open Innovation Center A non-profit organization led by Dr. Susie Stephens and focused

The Open Innovation Center A non-profit organization led by Dr. Susie Stephens and focused on enabling pre-competitive collaborations across the pharmaceutical industry with the following goals: • To increase health and well-being by enabling pharmaceutical companies to make better decisions during drug discovery and development • To provide an independent non-profit center for knowledge gathering, representation and mining • To create an ecosystem of organizations that adopt the same data standards and terminology thereby simplifying collaboration • To reduce risk and minimize cost • To bring together leading technologists to enable rapid sharing of knowledge and skills • For the benefit of organizations around the world in biopharmaceuticals, healthcare, payers, information technology, and academia Source: Susie Stephens 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 31

External Collaborations Participation in W 3 C’s HCLS group RDF Access to Relational Databases

External Collaborations Participation in W 3 C’s HCLS group RDF Access to Relational Databases - Chris Bizer, Eric Prud'hommeaux • Scalability testing of relational to RDF mapping approaches End User Semantic Web Authoring - David Karger • Enhancing the scalability and robustness of the Exhibit and Potluck tools (i. e. integrating the tools together, supporting more file types, etc. ) Scientist-Driven Semantic Integration of Knowledge in Alzheimer's Disease - Tim Clark, June Kinoshita • Project to develop an integrated knowledge infrastructure for the neuro-medical research community, pairing rich digital semantic context with the ever-growing digital scientific content on the web Provenance Collection and Management - Carole Goble, Beth Plale • Project to develop a metadata taxonomy for global data at Lilly which enables the rapid integration of data and mining/analysis algorithms into dataflows which support clinical and discovery decisions 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 32

Semantic Technologies (from an old DBA’s perspective) Conclusions Copyright© 2008 Eli Lilly and Company

Semantic Technologies (from an old DBA’s perspective) Conclusions Copyright© 2008 Eli Lilly and Company

Conclusions • Data integration needs (and issues) abound at Lilly! • Eli Lilly and

Conclusions • Data integration needs (and issues) abound at Lilly! • Eli Lilly and Company is seeing tangible benefits in multiple projects from semantic integration as a means for helping to solve this problem • The trend has been to build “semantic warehouses” due to federation challenges • Thus far, data volumes are low to moderate • Areas for alignment need to be identified and aligned as necessary (both internally and externally) • Still searching for the “best” methods for accessing semantic data holistically within the enterprise • Provenance is a challenge but is required • Tools are improving, but more are needed (especially in the area of visualization) • Working to operationalize semantic processes 3/7/2021 Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 34

Acknowledgements Rosalyn Adams-Smith Amit Aggarwal Rakhi Bhat Phil Brooks Steven Cao Hans Constandt William

Acknowledgements Rosalyn Adams-Smith Amit Aggarwal Rakhi Bhat Phil Brooks Steven Cao Hans Constandt William D Craun Mahesh Kumar Guzuva Desikan Ernst Dow Ann. Catherine Downing Mark Farmen Kevin Gao Young Gong David Greenen Ranga Chandra Gudivada Jacob Koehler 3/7/2021 Srinivasulu Kota Michael Lajiness Maurice Manning Michael Martin Mamatha Naik Laura Nisenbaum Pavel Pilar James E Scherschel Sean Spillane Susie Stephens Jeffrey Sutherland Dirk Tomandl Jason Wang Bill Yan Harold Yin Yijing Zhou Copyright© 2008 Eli Lilly and Company Semantic Technologies (from an old DBA’s perspective) 35