0 Cancer Biomedical Informatics Grid ca BIG An
0 Cancer Biomedical Informatics Grid (ca. BIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering National Cancer Institute Center for Bioinformatics
1 National Cancer Institute 2015 Goal Relieve suffering and death due to cancer by the year 2015
Origins of ca. BIG 2 4 Need: Enable investigators and research teams nationwide to combine and leverage their findings and expertise in order to meet NCI 2015 Goal. 4 Strategy: Create scalable, actively managed organization that will connect members of the NCI-supported cancer enterprise by building a biomedical informatics network and data can be seamlessly shared
ca. BIG Challenges 3 4 Handle diversity of data types 4 Precise “Meaning” of data 4 Provide local hosting of data 4 Local access control 4 Provide tools to “publish” and “access” data easily 4 High Performance computing will be needed in future
Interoperability 4 ability of a system to access and use the parts or equipment of another system Syntactic interoperability Semantic interoperability
How to Achieve Interoperability for Data Systems? 5 4 Well Documented public API access to data 4 Based on object oriented abstraction of underlying data – No particular technology or tool specified 4 Abstraction layer must be derived using widely accepted “standards” – Model Driven Architecture 4 Information Model is the “Metadata” of the data and needs to be persisted and accessible via API 4 Need to be able to “unambiguously” and programmatically determine the meaning of data
OMG Model Driven Architecture (MDA) Approach 6 4 Analyze the problem space and develop the artifacts for each scenario – Use Cases 4 Use Unified Modeling Language (UML) to standardize model representations and artifacts. Design the system by developing artifacts based on the use cases – – Class Diagram – Information Model Sequence Diagram – Temporal Behavior 4 Use meta-model tools to generate the code
Limitations of MDA 7 4 Limited expressivity for semantics 4 No facility for runtime semantic metadata management
8 ca. CORE Syntactic and Semantic Integration MDA Plus a whole lot more!
ca. CORE 9 Bioinformatics Objects Common Data Elements Enterprise Vocabulary S E C U R I T Y
Use Cases 10 4 Description 4 Actors 4 Basic Course 4 Alternative Course
Bioinformatics Objects 11
Common Data Elements 12 4 What do all those data classes and attributes actually mean, anyway? 4 Data descriptors or “semantic metadata” required 4 Computable, commonly structured, reusable units of metadata are “Common Data Elements” or CDEs. 4 NCI uses the ISO/IEC 11179 standard for metadata structure and registration 4 Semantics all drawn from Enterprise Vocabulary Service resources
Enterprise Vocabulary Description Logic Concept Code Relationships Preferred Name Definition Synonyms 13
Semantic metadata example: Agent 14 <Agent> <name>Taxol</name> <n. SCNumber>007</n. SCNumber> </Agent>
Why do you need metadata? Class/ Attribute Example CIA Metadata Object Data Agent NCI Metadata A sworn intelligence agent; a spy Chemical compound administered to a human being to treat a disease or condition, or prevent the onset of a disease or condition Agent n. SCNumber 007 Identifier given to an intelligence agent by the National Security Council Identifier given to chemical compound by the US Food and Drug Administration Nomenclature Standards Committee Agent name Taxol CIA code name given to intelligence agents Common name of chemical compound used as an agent 15
Computable Interoperability 16 Agent name n. SCNumber Drug C 1708 id C 1708: C 41243 NDCCode CTEPName approval. Date FDAInd. ID approver IUPACName fda. Code My model C 1708 Your model C 1708: C 41243
Cancer Data Standards Repository 17 4 ISO/IEC 11179 Registry for Common Data Elements – units of semantic metadata 4 Client for Enterprise Vocabulary: metadata constructed from controlled terminology and annotated with concept codes 4 Precise specification of Classes, Attributes, Data Types, Permissible Values: Strong typing of data objects.
ca. CORE Tools 18 4 UML Loader: automatically register UML models as metadata components 4 CDE Curation: Fine tune metadata and constrain permissible values with data standards 4 Form Builder: Create standards-based data collection forms 4 CDE Browser: search and export metadata components 4 Common Security Module: Provides role based security
ca. CORE Software Development Kit 19 4 UML Modeling Tool (any with XMI export) 4 Semantic Connector (concept binding utility) 4 UML Loader (model registration in ca. DSR) 4 Codegen (middleware code generator) 4 Security Adaptor (Common Security Module) ca. CORE SDK generates syntactically and semantically interoperable data service system
20 ca. Grid ca. CORE meets grid technology!
Use cases not satisfied by ca. CORE alone 21 4 Advertisement – Service Provider composes service metadata describing the service and publishes it to grid. 4 Discovery – – Researcher (or application developer) specifies search criteria describing a service of interest The research submits the discovery request to a discovery service, which identifies a list of services matching the criteria, and returns the list. 4 Invocation – Researcher (or application developer) instantiates the grid service and access its resources
22 OTHER TOOLKITS Silver OTHER ca. BIG SERVICE PROVIDERS Silver NCI Silver Cancer Center Gold Silver Cancer Center
ca. Grid Components 23 4 Leverage existing technologies: – – ca. DSR, EVS, Mobius GME: Common data elements, controlled vocabularies, schema management Globus Toolkit (currently version 4. 0. 1) • • Core grid services infrastructure Service deployment, service registry, invocation, base security infrastructure 4 Additional Core Infrastructure – – – Higher-level security services (Dorian) Grid service access to metadata components (ca. DSR, GME, etc) Workflow, Identifier services 4 Service Provider Tooling (Introduce) – – – Graphical service development and configuration environment Abstractions from service infrastructure for Data and Analytical services Deployment wizards 4 Client Tooling – – High-level APIs for interacting with core components and services Graphical Tools
ca. Grid 0. 5 Architecture (May be updated for 1. 0) 24 Quality of Service Functions Service GT 3 GME ca. DSR Service Description Index Grid Communication Protocol GLOBUS Toolkit Transport GUMS GSI CAMS Resource Management OGSA-DAI UI Process Security GT 3 Analytical Business ID Resolution Semantic service EVS Service Registry ca. DSR GT 3
Data Object Semantics, Metadata, and Schemas 4 Object oriented, APIs, well-defined data types 4 Classes defined in UML and converted into ISO/IEC 11179, registered in the ca. DSR 4 Definitions drawn from Enterprise Vocabulary Services (EVS), relationships semantically described 4 XML serialization of objects adhere to XML schemas registered in the Global Model Exchange (GME) 25
Introduce Toolkit 26 4 A framework which enables fast and easy creation of ca. Grid compatible services whether they are data, analytical, custom, or core services. 4 Provide easy to use graphical service authoring tools. 4 Hide all “grid-ness” from the developer so that they can concentrate on the domain expert implementation. 4 Utilize best practice layered grid service architectures. 4 Handle all service architecture requirements of the ca. Grid. – – – Strong service interface data typing Metadata and service registration Grid security integration
Data Service Access on ca. Grid 27 4 Specialization of ca. Grid grid services to expose data through a common query interface 4 Present an object view of data sources 4 Exposed objects are registered in ca. DSR and their XML representation in GME 4 Queries made with ca. BIG Query Language (CQL) Query objects 4 Results returned as objects (or identifiers) nested in a CQL Query Result Set
Data Service Query Language 28 4 Specialization of ca. Grid grid services to expose data through a common query interface 4 Present an object view of data sources 4 Exposed objects are registered in ca. DSR and their XML representation in GME 4 Queries made with CQL Query objects 4 Results returned as objects (or identifiers) nested in a CQL Query Result Set
Data Service Interface 29 public CQLQuery. Results. Type process. Query(CQLQuery. Type query) 4 Data Provider’s only responsibility is to implement CQL over their local data resource – A default implementation will be provided for ca. CORE SDK created systems 4 ca. Grid provides grid service implementation to invoke provider’s CQL implementation 4 Service provides all features necessary for compliance, such as advertisement of data service metadata, and security integration
Data Service Query Scenario 30 1. Client builds a CQL Query 2. CQL Query is serialized and submitted to the Grid Data Service 3. Grid Data Service deserializes the CQL Query Object and processes it 4. Data Source is queried by the Grid Data Service 5. Grid Data Service Builds a CQL Result Set 6. Result Set is serialized and returned to the client 7. Client deserializes result set 8. Result set is iterated with client tools to retrieve objects
Federated and Aggregated Queries 31 4 Componentized library being developed to facilitate limited federating and aggregating queries 4 An extension language used to describe distributed queries 4 Library creates and executes a Query Plan for the distributed query, using multiple CQL queries to targeted data services
Data Service Client Tooling 32 4 APIs provided to discover available data services on the grid based on client-defined criteria (such exposed data models and concepts) 4 Object-Oriented API for building queries, querying a given data service, and processing the results 4 Client tools available to iterate query result sets – – Object iterator deserializes XML into registered objects XML iterator simply returns XML documents
Acknowledgements (ca. Grid Team) 33 4 Ohio State University Department of Bio. Medical Informatics – – – Dave Ervin Shannon Hastings Tahsin Kurc Stephen Langella Scott Oster Joel Saltz 4 Argonne National Lab / University of Chicago – – – William Allcock Jarek Gawor Ravi Madduri Frank Siebenlist Michael Wilde 4 Duke University – – A. Jamie Cuticchia Patrick Mc. Connell 4 Georgetown University – – – Colin Freas Paul A. Kennedy Chad La Joie 4 SAIC (http: //www. saic. com) – Manav Kher 4 Scen. Pro/Semantic Bits – – – Vinay Kumar David Wellborn Valerie Bragg 4 Booz | Allen | Hamilton (http: //www. bah. com) – Arumani Manisundaram – Michael Keller – Reechik Chatterjee
Acknowledgements NCI Andrew von Eschenbach Anna Barker Industry Partners NCICB Wendy Patterson SAIC Ken Buetow OC BAH Peter Covitz DCTD Oracle George Komatsoulis DCB Scen. Pro Denise Warzel DCP Ekagra Frank Hartel DCEG Apelon Sherri De Coronado DCCPS Terrapin Systems Dianne Reeves CCR Panther Informatics Gilberto Fragoso Jill Hadfield Leslie Derr 34
- Slides: 35