The cancer Biomedical Informatics Grid Connecting the Cancer
The cancer Biomedical Informatics Grid: Connecting the Cancer Research Community Scott Oster Department of Biomedical Informatics Ohio State University Challenges of Large Applications in Distributed Environments (CLADE) 2007 Monterey Bay, California June 25, 2007
Agenda § ca. BIG Overview § ca. Grid § Challenges of ca. BIG 2
Cancer Background § This year there will be approximately 1, 400, 000 Americans diagnosed with cancer § More than 500, 000 Americans are expected to die from cancer this year § In 2005, the NIH estimated costs for cancer at $209. 9 billion, with direct medical costs of $74 billion 3
First, a visionary non-technical challenge… 4
National Cancer Institute 2015 Goal “Relieve suffering and death due to cancer by the year 2015” 5
Origins of ca. BIG § Goal: Enable investigators and research teams nationwide to combine and leverage their findings and expertise in order to meet NCI 2015 Goal. § Strategy: Create scalable, actively managed organization that will connect members of the NCI-supported cancer enterprise by building a biomedical informatics network 6
ca. BIG Community § More than 50 Cancer Centers § 30 Organizations (of 61 total) § Government, Industry, Standards § Over 800 people 7
ca. BIG Domain Workspaces § The data and tool producers: § Clinical Trial Management Systems § Provides software tools for consistent, open, and comprehensive clinical trials management, including enrollment of patients, tracking of protocols, recording of outcomes information, administration of trials, and submission of data to regulatory authorities § Integrative Cancer Research § Builds software tools and systems to enable integration of clinical information (such as data collected from biospecimen donors) with molecular information (such as data from high-throughput genomic and proteomic technologies) § In Vivo Imaging § Provides technology for the sharing and analysis of in vivo (in the body) imaging data, such as MRI and PET scans, both in basic and clinical research settings § Tissue Banks and Pathology Tools § Develops software tools for the collection, processing, and dissemination of biospecimens, including the annotation of those biospecimens with donor clinical and protocol data, as well as for the operational and administrative aspects of biorepositories 8
ca. BIG Strategic Workspaces § The policy makers: § Data Sharing and Intellectual Capital § Develops policies for the sharing of data, software, and inventions within the ca. BIG™-funded cancer community. This workspace addresses, for example, how to implement patient protection policies; the ethical, legal, and contractual obligations associated with the sharing of clinical data and biospecimens; and how the public and private sector should interact when using ca. BIG™ tools in collaboration § Documentation and Training § Provides technical training for software developers in the use of the ca. BIG™ resources, including online tutorials, workshops, and education programs § Strategic Planning § Assists in identifying strategic priorities for the development and evolution of ca. BIG™ 9
ca. BIG Cross-Cutting Workspaces § The infrastructure and standards developers: § Architecture § Develops communication standards and systems necessary for all other ca. BIG™ workspaces to inter-connect as a grid via the Internet, including solutions for access control, security, and patient data protection § Vocabularies and Common Data Elements § Creates data standards, including the development, promotion, and support of vocabularies, ontologies, and common data elements to ensure that the entire ca. BIG™ community is speaking the same “language. ” Such common data standards are a key component to ensure that large scale NCI projects generate interoperable information 10
What is ca. BIG? § Common, widely distributed infrastructure that permits the cancer research community to focus on innovation § Shared, harmonized set of terminology, data elements, and data models that facilitate information exchange § Collection of interoperable applications developed to common standards § Cancer research data available for mining and integration 11
Driving needs § A multitude of “legacy” information systems, most of which cannot be readily shared between institutions § An absence of tools to connect different databases § An absence of common data formats § A huge and growing volume of data must be collected, analyzed, and made accessible § Few common vocabularies, making it difficult, if not impossible, to interlink diverse research and clinical results § Difficulty in identifying and accessing available resources § An absence of information infrastructure to share data within an institution, or among different institutions 12
So there are technical challenges as well… 13
What is ca. Grid? § § § § Development project of Architecture Workspace The Grid infrastructure for ca. BIG (the “G” in ca. BIG) Driven from use cases and needs of cancer research community Service Oriented Architecture Based on federation Model Driven Object-Oriented, Semantically-Annotated Data Virtualization 14
What is ca. Grid? cont… § Builds on existing Grid technologies § Provides additional enterprise Grid components § § § § § Grid Service Graphical Development Toolkit Metadata Infrastructure Advertisement and Discovery Semantic Services Data Service Infrastructure Analytical Service Infrastructure Identifiers Workflow Security Infrastructure Client tooling 15
Agenda § ca. BIG Overview § ca. Grid § Challenges of ca. BIG 16
Issue: Disparate systems § No common infrastructure for applications, databases, etc § Variety of programming languages § Variety of platforms and operating systems § Inability to interoperate with other systems throughout virtual organization 17
Approach: Disparate systems § Create and leverage a standards-based Grid (ca. Grid) § WSRF web services using SOAP/HTTP(s) § Creation of compatibility guidelines and review process § Define a uniform query interface and language for data providing systems § Provide common infrastructure services most federation scenarios § Focus on tools for virtualizing existing systems and APIs behind these grid interfaces § Open Issue: some systems require more manual work than others § Open Issue: tradeoff between specificity and universal applicability 18
Introduce § § Graphical Development Environment for Grid Services Provides simple means to create a service skeleton that a developer can then implement, build, and deploy Provides a set of tools which enable the developer to add/remove/modify/import methods of the service Automatic code generation (WSDL, service and client APIs, JNDI, WSDDs, security descriptors, metadata, etc) 19
Issue: Lack of common Data Formats § Tools use widely varying and/or proprietary data formats § Lack of formal definition § Not all suitable for communication with remote systems § Lack of uniform way to discover and understand the formats 20
Approach: Lack of common Data Formats § Adopt XML as data exchange format § Leverage XML Schemas for definition § Global Model Exchange service for publishing, managing, and discovering XML Schemas § Leverage UML for logical definition of data models § Cancer Data Standards Repository (ca. DSR) captures logical model with annotations; facilitates reuse and formal definition § Formal binding of logical model (UML) and exchange model (XML) § Community review of the use of standards for new systems § Open Issue: Data translation still necessary when existing system can’t be easily changed (though some ca. BIG tools exist to address this; e. g ca. Adapter) § Open Issue: tradeoff between reuse and creating the new “perfect model” 21
Issue: Data Interoperability § Common data formats allow for syntactic data interoperability but are not sufficient for ensuring common semantics § May work with wholesale adoption of common domain -specific models, but breaks down cross-model § Need to understand the meaning of the value domains and terminology of a data format or system § Assumptions of meaning can be dangerous, even deadly, in the medical domain 22
Interoperability § The ability of multiple systems to exchange information and to be able to use the information that has been exchanged. Syntactic interoperability Semantic interoperability 23
Semantics Example <Agent> <name>Taxol</name> <n. SCNumber>007</n. SCNumber> </Agent> Class/ Attribute Example Data Agent CIA Definition NCI Definition A sworn intelligence agent; a spy Chemical compound administered to a human being to treat a disease or condition, or prevent the onset of a disease or condition Identifier given to chemical compound by the US Food and Drug Administration Nomenclature Standards Committee Agent. n. SCNumber 007 Identifier given to an intelligence agent by the National Security Council Agent. name Taxol CIA code name Common name of chemical compound given to intelligence used as an agents
Approach: Data Interoperability § Community maintained and curated shared ontology § Enterprise Vocabulary Services (EVS) maintains and provides access to the data semantics and controlled vocabulary of all models § Definitions, synonyms, relationships, etc § All models in ca. DSR annotated with terminology and concepts from EVS § Focus on identifying “Common Data Elements” as semantically equivalent attributes § Based on ISO 11179 Information Technology – Metadata Registries (MDR) parts 1 -6 § Community review of the use of standards and harmonization for new systems § Open Issue: Is it possible to scale to federated terminologies? § Open Issue: High initial cost to entry; high overhead to maintaining quality 25
ca. Grid Data Description Infrastructure • Client and service APIs are object oriented, and operate over well-defined and curated data types • Objects are defined in UML and converted into ISO/IEC 11179 Administered Components, which are in turn registered in the Cancer Data Standards Repository (ca. DSR) • Object definitions draw from controlled terminology and vocabulary registered in the Enterprise Vocabulary Services (EVS), and their relationships are thus semantically described • XML serialization of objects adhere to XML schemas registered in the Global Model Exchange (GME)
Issue: Finding Resources § Creating infrastructure for programmatic interoperability is excessive without a way to dynamically find and use previously unknown resources § Resources need to be self-descriptive enough such that their use and value can be determined 27
Approach: Finding Resources § Rich set of standardized metadata publicly provided by each service § Operations and data types described in terms of structure and semantics extracted from ca. DSR and EVS § Services register existence with Index Service, and metadata is aggregated § Tools for querying Index Service, and analyzing metadata are provided § Open Issue: Lines between data and metadata are blurry at best § Some key distinctions in ca. BIG are metadata is publically accessible, and describes "types" not instances 28
Advertisement and Discovery Process § All services register their service location and metadata information to an Index Service § The Index Service subscribes to the standardized metadata and aggregates their contents § Clients can discover services using a discovery API which facilitates inspection of data types § Leveraging semantic information (from which service metadata is drawn), services can be discovered by the semantics of their data types § “Find me all the services from Cancer Center X” § “Which Analytical services take Genes as input? ” § “Which Data services expose data relating to lung cancer? ”
Issue: Data Size § Numerous Sources of Large Data sets § Imaging § Tumor Microenvironment § High Resolution Scanning= 25 TB/cm 2 tissue § Image repositories § Multiple Modalities, thousands of cases, Millions of images, terabytes of data § Mouse Models § terabytes of data § Proteomics Tim e § Modest Example: § 30 samples, 10 fractions, 10 runs, 1. 5 MB per spectra = 4. 5 GB § Many others 30
Approach: Data Size § Often a tradeoff between optimized performance and interoperability § e. g. Out of band binary transfer vs XML/SOAP/HTTP § Currently Leveraging: § Transfer: ws-enumeration, Grid. FTP (with integrated security, and metadata) § Avoid Transfer: Identifiers, federated query, workflow, colocation § Looking at: § Moving services to data (Imaging) § Binary data format descriptions for binary metadata (e. g DFDL) § New area of address; much more to do… 31
Issue: User Accounting § Most legacy systems built with local users and permissions § can’t require users to maintain hundreds of accounts, but need to still allow local policy § Central account management and identity vetting not tractable § but there are too many organizations with differing infrastructures to try to establish point to point relationships 32
Approach: User Accounting § Provide Single Sign On to grid via X. 509 proxy certificates § Grid Authentication and Authorization with Reliably Distributed Services (GAARDS) § Federate Identity Management (Dorian) § Rely on participating institutions to vouch for identity of their members § Standardize on identity assertion language and attributes § Integrate existing institutional identity management systems as Registration Authorities, into aggregate Certificate Authorities § Distribute revocations via Grid Trust Service (GTS); discussed later 33
GAARDS in Action Authenticate with Local Credential Provider SAML Assertion User authenticates to local credential provider using your everyday user credentials
GAARDS in Action SA ML A ss ert ion Cr Gri ed d en tia ls Application obtains grid credentials from Dorian using SAML provided by the local provider.
GAARDS in Action Application uses grid credentials to invoke secure grid services. Grid Credentials
Issue: Data Privacy § § Lots of interesting data involves human subjects in some form Numerous barriers to data and resource sharing in ca. BIG § Federal, state, and local law; regulations; institutional policies § Institutional Review Boards (IRB) involved for any protected health information (PHI); even for de-identified data § Grid is new technology; IRBs must give very detailed protocol approvals § Most regulations are more than just "who“; "how" and "for what" matters § § Grid is multi-institutional which means IRBs must reach agreements (read: separately employed lawyers working together) Legal and policy requirements related to privacy and security drivers include: § HIPAA Privacy and Security Rules § The Common Rule for Human Subjects Research § FDA Regulations on Human Subjects § 21 CFR Part 11 § State and institutional requirements 37
Approach: Data Privacy § Though some aspect of solutions require technology (auditing, provenance, encryption/digital signing), the problem cannot be solved by technology alone § Data Sharing and Intellectual Capital Workspace (DSIC) § Identification of issues; development of guidelines; template agreements; education and training § Some ca. BIG (and external) tools exist for automated deidentification § Can leverage authorization solutions (Grid. Grouper for group-based; CSM for local policy; Globus PDPs for complex rules) § Open Issue: What technologies and policies (if any) can be universally adopted? § Open Issue: To date emphasis of development security infrastructure in ca. BIG has been around services; not data § Lots of work to do… 38
Issue: Intellectual Captial § Social problem § “Publish or perish” § Justified hesitance to share pre-publication data § Justified reluctance to advance the cause of competitors (industrial and academic) § Can I rely on the data/results of some (potentially) unknown entity? § If cancer is cured, and ca. BIG resources play a role, there will be much interest in knowing who contributed what (and who funded them) § Proper attribution is not just ethical, its often required 39
Approach: Intellectual Captial § Technological § Provenance may or may not be enough (annotation vs enforcement) § Socio-Cultural § Whole workspace in ca. BIG dedicated to it (DSIC) § NCI in a good position to “encourage” it § Large percentage of institutions’ cancer research funding comes from NCI § Hope is motivation will be value-based once initially primed § Starting to see movement from “wait and see” to active engagement; industry involvement § Lots of work to do… 40
Issue: Complicated Trust Arrangements § When hundreds of organizations are sharing data and providing access to each other’s systems, defining a trust model is complicated, even for public data § For non-public data/systems, the simplest/safest policy is “deny all” § For many data sets and services, the owning organization may be virtual § Central authority is socially and technologically intractable § Rapid propagation of information on compromised systems/individuals is critical 41
Approach: Complicated Trust Arrangements § Grid Authentication and Authorization with Reliably Distributed Services (GAARDS) § Federated Trust Models (GTS) § Establish and manage trust relationships between institutions through adherence to mutually agreed upon policy § Promote global policy distribution, but allow arbitrary local overrides § Provide enterprise tools and services for management and automate distribution of information 42
Grid Trust Service (GTS) Federation § A GTS can inherit Trusted Authorities and Trust Levels from other Grid Trust Services § Allows one to build a scalable Trust Fabric § Allows institutions to stand up their own GTS, inheriting all the trusted authorities in the wider grid, yet being to add their own authorities that might not yet be trusted by the wider grid § A GTS can also be used to join the trust fabrics of two or more grids 43
GAARDS in Action Application uses grid credentials to invoke secure grid services. Grid Credentials
GAARDS in Action Should I trust the credential signer? Grid Service authenticates the user by asking the GTS whether or not the signer of the credential should be trusted.
Issue: Computationally Expensive § Many studies on molecular data require expensive calculations on large data sets § Statistical analysis, hypothesis testing, searches § Researchers lack necessary computing resources 46
Approach: Computationally Expensive § Variety of well-known solutions exist in Grid and cluster space (a main driving force of their existence) § Challenge is in seamlessly integrating with abstraction layer in use § i. e Operations on semantically annotated objects, not scheduled jobs on flat files § Leverage virtualization; domain specific service interface over general computational resources § Tera. Grid, Super Computer Centers § Open Issue: Balancing abstraction vs control (e. g. scheduling priorities, cost models, optimizations, etc) § Open Issue: Appropriate level of control for service as resource broker § Open Issue: Complexity moved from client to service developer (working on tools to facilitate) 47
ca. Grid/Tera. Grid Overview 48
Issue: Evolving Infrastructure § Standards in Web/Grid service domain are turbulent at best § Competing interests of “big business” and multiple standards bodies § Major revisions of toolkits generally not backwards compatible § Interface stability vs new features § Don’t want multiple grids § Upgrade or perish? Staying behind means lack of support § Application layer abstractions help developers, but don’t address “wire incompatibility” 49
Approach: Evolving Infrastructure § Most traditional solutions are in conflict with stronglytyped requirements or complicate service development (unless extensibility built into spec) § e. g. Lax processing; must ignore/must understand with schema overloading; multiple (protocol) service interfaces § Abstract specifications from developers with tooling § Focus on rigid “data format” specifications, allow more freedom on composition into messages § Open Issue: Doesn’t address wire incompatibility § Open Issue: No good solution § Do we need to just get it “good enough” and stabilize? 50
Summary § The bad news: § Large-scale, distributed knowledge sharing is hard • Disparate Systems • Lack of Common Data Formats • Data Interoperability • Finding Resources • Data Size • User Accounting • Data Privacy • Intellectual Capital • Complicated Trust Arrangements • Computationally Intensive • Evolving Infrastructure § The good news: § The potential rewards are large § The good news (for computer scientists): § There are lots of unsolved problems (and interest in getting them solved) 51
The cancer Biomedical Informatics Grid: Connecting the Cancer Research Community Scott Oster Department of Biomedical Informatics Ohio State University Challenges of Large Applications in Distributed Environments (CLADE) 2007 Monterey Bay, California June 25, 2007
BACKUP SLIDES 53
Standardized Service Metadata § Common Service Metadata § Provided by all services § Details service’s capabilities, operations, contact information, hosting research center § Service operation’s inputs and outputs defined in terms of structure and semantics extracted from ca. DSR and EVS § Service Security Metadata § Provided by all services § Details the service’s requirements on communication channel for each operation § Can be used by client to programmatically negotiate an acceptable means of communication § Data Service Metadata § Provided by all data services § Describes the Domain Model being exposed, in terms of a UML model linked to semantics § Provides information needed to formulate the Object-Oriented Query § As with common metadata, data types defined in terms of structure and semantics extracted from ca. DSR and EVS 54
ca. BIG Data Hierarchy Level I: Collection • Access control • Patient Privacy • Data integrity • Provenance metadata for attribution • Authentication of authorship • Information Security Level II: Closed Distribution • External access controls • Dynamic permissions for limited access • Materials transfer issues • Mechanisms for data escrow Level III: Public Distribution/ Access • Data released from escrow • Data transmission security • Dynamic permissions for general access • Provision for IP ownership as opposed to access Issues affecting the data are cumulative, i. e. data functioning in Level II will also raise the issues raised for Level I data; Level III data will require attention to the issues raised by both Level I and II, et cetera. Level IV: Post. Publication Attribution • Provenance metadata for publication • Community standards for attribution of authorship • Dynamic permissions for general release • Data escrow for publication
Level I data issues Level I data is all data collected by the ca. BIG system, including patient data, analyses, records, and research, regardless of whether that data is released to other researchers, the public, or parties other than the one that originally provides the data to the system. Issues raised include: § Access Controls. Management, operational, and technical controls are necessary to create a methodology for restricting access to data in ca. BIG consistent with the authorization of the individual or entity. § Patient Privacy. Data must be collected and stored in a manner that protects the privacy interests of the data subjects, consistent with the HIPAA Privacy Rule, the Common Rule of human subjects research (reflected in the Code of Federal Regulations), and other state, local, ethical, and institutional requirements. § Data integrity. Mechanisms must be available to ascertain that data has been entered accurately and will not be inappropriately modified in the transfer from its point of origin, while maintained in ca. BIG, or subsequently. § Provenance metadata for attribution. Individual contributors’ interests must be protected by assuring that the system allows data submitted to be associated with information concerning its authorship, collection, or creation, and that a mechanism exists for data originators to amend incorrect provenance information. § Authentication of authorship. Mechanisms and processes must be available to verify that provenance data correctly identifies the source of contributed data and information. Such protections may include digital signatures (as described in 21 CFR 11) and other methods. § Information Security. Data must be collected and stored in a manner that protects the privacy interests of the data subjects, consistent with the HIPAA Security Rule, the Federal Information Security Management Act of 2002, and other Federal, state, local, ethical, and institutional requirements. 56
Level II data issues Level II data is data that is collected and then shared by some limited subset of potential data users, but not all ca. BIG users or the general public. These individuals could include, for example, the party that contributed the data only, individuals that have reached private agreements with those that have contributed the data, or individuals granted “role-based” access to certain categories. Issues raised at this level include: § External Access Controls. Level I data requires controls for access to ca. BIG: Level II data requires management, operational, and technical controls to limit access to the ca. BIG users authorized to view data originated elsewhere § Dynamic permissions for limited access. Access controls will need to be flexible enough to change what data individuals have access to as roles, agreements, and activities of individual system users change over time. § Integration with materials transfer processes. Information sharing practices facilitated by ca. BIG must be aligned with practices for individuals or groups that share, transfer or provide access to tissues, cultures, cell lines, research animals, or other material shipped from one location to another. § Mechanisms for data escrow. Common research practices require data to be available for verification of research findings but not available for access, alteration, or further analysis until the validity of research findings is verified. ca. BIG will need to include a mechanism to allow data stored on the system to be partitioned off consistent with these requirements. 57
Level III data issues Level III data is data made available to general audiences, including all ca. BIG users, all interested researchers, or the general public. Level III data issues include: § Data released from escrow. Once data has been cleared for general access (either due to the conclusion of the prepublication issues described under Level II data above, or pursuant to an arrangement with the data’s originator), it must be made available in a manner consistent with ca. BIG policy and in a way that does not compromise the data’s integrity or required attribution. § Data transmission security. Management, operational, and technical controls should assure that data integrity is not compromised in transit, nor that poor security practices on the part of ca. BIG system users create platforms for security breaches of the ca. BIG system itself. § Dynamic permissions for general access. As with Level II data, access must be granted appropriately to users. As well as access levels for users, data must also be assigned security categories such that data can be re-categorized from having a specified, limited availability to becoming more generally available. § Provision for intellectual property ownership as opposed to access. Researchers may be willing to share data for limited purposes or a limited data set, and may wish to retain rights to be acknowledged for collecting data, generating analyses, or previous publications. Mechanisms must be in place to allow individuals the ongoing ability to benefit from their research or retain exclusive rights to it if contracts or other conditional agreements so require. 58
Level IV data issues Level IV data is data that will be used for analyses, research, or other writing that will be attributed to one or more authors or individuals as the author, creator, sponsor, or other related party. Level IV data requires special protections such that the proper attribution is received for the particular accomplishments or expertise associated with that data. Level IV data issues include: § Provenance metadata for publication. Provenance metadata provisions for Level I data should govern the rights, restrictions and considerations relevant to authorship and attribution for data to be published. § Community standards for attribution of authorship. An appropriate, written protocol that accounts for existing law, policy and custom should exist to reflect how data and information is generated and in what capacity each participant contributed. § Dynamic permissions for general release. Data permissions for attributed data may require escrow; delivery to third parties for verification and analyses; and added provenance metadata for modified or concurrently developed material. § Data escrow for publication. Many journals require that data used in publications remain in escrow prior to (or following) publication to allow other researchers to validate findings. ca. BIG processes would need to be compatible with this requirement. 59
- Slides: 59