Data Integration in the Life Sciences Kenneth Griffiths

Tutorial Agenda 1: 30 – 1: 45 – 2: 00 – 3: 05 –

Life Science Data Recent focus on genetic data “genomics: the study of genes and

The Study of Genes. . . • • • Chromosomal location Sequence Variation Splicing

… and Their Function • • • Homology Motifs Publications Expression HTS In Vivo/Vitro

Understanding Mechanisms of Disease Metabolic and regulatory pathway induction

Development of Drugs, Vaccines, Diagnostics Differing types of Drugs, Vaccines, and Diagnostics • Small

The Industry’s Problem Too much unintegrated data: – from a variety of incompatible sources

What are the Data Sources? • • Flat Files URLs Proprietary Databases Public Databases

Sample Problem: Hyperprolactinemia Over production of prolactin – prolactin stimulates mammary gland development and

Understanding transcription factors for prolactin production “Show me all genes in the public literature

Approaches to Integration In order to ask this type of question across multiple domains,

Approaches to Integration (cont. ) So if one agrees that the preceding issues are

Federated Database Approach Integrated Application “Show me all genes that are homologous to known

Advantages to Federated Database Approach • quick to configure • architecture is easy to

Problems with Federated Database Approach • Integration of queries and query results occurs at

Solving Federated Database Problems Integrated Application Semantic Cleaning Layer Middleware (CORBA, DCOM, etc) Seq.

Data Warehousing for Integration Data warehousing is a process as much as it is

Data Warehousing Source Data Warehouse (integrated Datamarts) E (Extraction) T (Transformation) L (Load)

Data-level Integration Through Data Warehousing SEQUENCE EXPRESSION Seq. Web Tx. P App genbank proprietary

Data Staging Storage area and set of processes that • extracts source data •

Data Staging (cont. ) • Sixty to seventy percent of development is here •

Warehouse Development and Deployment Two development paradigms: Top-down warehouse design: conceptualize the entire warehouse,

Warehouse Development and Deployment (cont. ) The Data Mart: “A logical subset of the

Warehouse Development and Deployment (cont. ) Examples of data marts in Life Science: –

Advantages of Data-level Integration Through Data Warehousing • Integration of data occurs at the

Issues with Data-level Integration Through Data Warehousing • ETL process can take considerable time

Indexing Data Sources • Indexes and links a large number of data sources (e.

Indexed Data Source Architecture I Sequence indexed data sources I Gx. P indexed data

Indexed Data Sources: Pros and Cons Advantages • quick to set up • easy

Memory-mapped Integration • The idea behind this approach is to integrate the actual analytical

Memory Map Architecture Sample/Source Information Sequence DB #1 Sequence DB #2 Data Integration Layer

Memory Maps: Pros and Cons Advantages • true “analytical” data integration • quick access

The Need for Metadata For all of the previous approaches, one underlying concept plays

Metadata “The data about the data…” • Describes data types, relationships, joins, histories, etc.

Metadata (cont. ) Back-end metadata - supports the developers Source system metadata: versions, formats,

Metadata Benefits • Enables the application designer to develop generic applications that grow as

Integration Technologies • Technologies that support integration efforts • Data Interchange • Object Brokering

Data Interchange • Standards for inter-process and inter-domain communication • Two types of data

XML Emerges • Allows uniform description of data and metadata – Metadata described through

XML in Life Sciences • Lots of momentum in Bio community • GFF (Gene

XML – DTDs • Interchange format defined through a DTD – Document Type Definition

XML Summary Benefits Drawbacks • Metadata and data have same format • HTML-like •

Object Brokering • The details of data can often be encapsulated in objects –

Enter CORBA • Common Object Request Broker Architecture • Applications have access to method

CORBA IDL • IDL – Interface Definition Language – Like C++/Java headers, but with

CORBA Summary Benefits Drawbacks Distributed Component-based architecture Promotes reuse Doesn’t require knowledge of implementation

Modeling Techniques E-R Modeling • Optimized for transactional data • Eliminates redundant data •

Illustrating Dimensional Data Space Sample problem: monitoring a temperature-sensitive room for fluctuations z y

Dimensional Modeling Primer • Represents the data domain as a collection of hypercubes that

Dimensional Modeling Primer Relational Representation • Contains a table for each dimension • Contains

Dimensional Modeling Primer Relational Representation • Each dimension table most often contains descriptive textual

Dimensional Modeling Primer Relational Representation • Dimension tables are typically small, on the order

Dimensional Modeling Primer Relational Representation Neither dimension tables nor fact tables are necessarily normalized!

Case in Point: Sequence Clustering Run run_id who when purpose Result runkey(fk) seqkey(fk) “Show

Dimensionally Speaking… Sequence Clustering CONCEPTUAL IDEA - The Star Schema: A historical, denormalized, subject-oriented

Dimensional Modeling Strengths • Predictable, standard framework allows database systems and end user query

The Need for Standards In order for any integration effort to be successful, there

Standard Bio-Ontologies Currently, there are efforts being undertaken to help identify a practical set

Standard Object Models Currently, there is an effort being undertaken to develop object models

In Conclusion • Data integration is the problem to solve to support human and

Accessing Integrated Data Once you have an integrated repository of information, access tools enable

Browsing One of the most critical requirements that is overlooked is the ability to

Querying Along with browsing, retrieving the data from the repository is one of the

Visualizing There a number of visualization tools currently available to help investigators analyze their

Data Mining Life science has large volumes of data that, in its rawest form,

Database Schemas for 3 independent Genomics systems Seq_DB_Key Species Seq_DB_Name SEQUENCE Sequence_Key Map_Key Qualifier_Key

The Warehouse Three star schemas of heterogenous data joined through a conformed dimension GENE_EXPRESSION_RESULT

Slides: 72

Download presentation

Data Integration in the Life Sciences Kenneth Griffiths and Richard Resnick

Tutorial Agenda 1: 30 – 1: 45 – 2: 00 – 3: 05 – 4: 00 – 4: 15 – 4: 30 – 5: 00 – 5: 30 Introduction Tutorial Survey Approaches to Integration Bio Break Approaches to Integration (cont. ) Question and Answer Break Metadata Session Domain-specific example (Gx. P) Wrap-up

Life Science Data Recent focus on genetic data “genomics: the study of genes and their function. Recent advances in genomics are bringing about a revolution in our understanding of the molecular mechanisms of disease, including the complex interplay of genetic and environmental factors. Genomics is also stimulating the discovery of breakthrough healthcare products by revealing thousands of new biological targets for the development of drugs, and by giving scientists innovative ways to design new drugs, vaccines and DNA diagnostics. Genomics-based therapeutics include "traditional" small chemical drugs, protein drugs, and potentially gene therapy. ” The Pharmaceutical Research and Manufacturers of America - http: //www. phrma. org/genomics/lexicon/g. html Study of genes and their function Understanding molecular mechanisms of disease Development of drugs, vaccines, and diagnostics

The Study of Genes. . . • • • Chromosomal location Sequence Variation Splicing Protein Sequence Protein Structure

… and Their Function • • • Homology Motifs Publications Expression HTS In Vivo/Vitro Functional Characterization

Understanding Mechanisms of Disease Metabolic and regulatory pathway induction

Development of Drugs, Vaccines, Diagnostics Differing types of Drugs, Vaccines, and Diagnostics • Small molecules • Protein therapeutics • Gene therapy • In vitro, In vivo diagnostics Development requires • Preclinical research • Clinical trials • Long-term clinical research All of which often feeds back into ongoing Genomics research and discovery.

The Industry’s Problem Too much unintegrated data: – from a variety of incompatible sources – no standard naming convention – each with a custom browsing and querying mechanism (no common interface) – and poor interaction with other data sources

What are the Data Sources? • • Flat Files URLs Proprietary Databases Public Databases Data Marts Spreadsheets Emails …

Sample Problem: Hyperprolactinemia Over production of prolactin – prolactin stimulates mammary gland development and milk production Hyperprolactinemia is characterized by: – inappropriate milk production – disruption of menstrual cycle – can lead to conception difficulty

Understanding transcription factors for prolactin production “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3 -fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors. ” (Q 1 Q 2 Q 3) “Show me all genes that are homologous to known have more than 3 -fold expression differential transcription factors” SEQUENCE between hyperprolactinemic and normal pituitary cells” “Show me all genes in the public literature that are putatively related to hyperprolactinemia” EXPRESSION LITERATURE

Approaches to Integration In order to ask this type of question across multiple domains, data integration at some level is necessary. When discussing the different approaches to data integration, a number of key issues need to be addressed: • Accessing the original data sources • Handling redundant as well as missing data • Normalizing analytical data from different data sources • Conforming terminology to industry standards • Accessing the integrated data as a single logical repository • Metadata (used to traverse domains)

Approaches to Integration (cont. ) So if one agrees that the preceding issues are important, where are they addressed? In the client application, the middleware, or the database? Where they are addressed can make a huge difference in usability and performance. Currently there a number of approaches for data integration: • Federated Databases • Data Warehousing • Indexed Data Sources • Memory-mapped Data Structures

Federated Database Approach Integrated Application “Show me all genes that are homologous to known transcription factors” “Show me all genes that have more than 3 fold expression differential between hyperprolactinemic and normal cells (Q 1 Q 2 Q 3) “Show me all genes in the public literature that are putatively related to hyperprolactinemia” Middleware (CORBA, DCOM, etc) Seq. Web genbank proprietary SEQUENCE Tx. P App c. DNA µArraydb Oligo Tx. P DB EXPRESSION Pub. Med Proprietary App Medline LITERATURE

Advantages to Federated Database Approach • quick to configure • architecture is easy to understand - no knowledge of the domain is necessary • achieves a basic level of integration with minimal effort • can wrapper and plug in new data sources as they come into existence

Problems with Federated Database Approach • Integration of queries and query results occurs at the integrated application level, requiring complex low-level logic to be embedded at the highest level • Naming conventions across systems must be adhered to or query results will be inaccurate - imposes constraints on original data sources • Data sources are not necessarily clean; integrating dirty data makes integrated dirty data. • No query optimization across multiple systems can be performed • If one source system goes down, the entire integrated application may fail • Not readily suitable for data mining, generic visualization tools • Relies on CORBA or other middleware technology, shown to have performance (and reliability? ) problems

Solving Federated Database Problems Integrated Application Semantic Cleaning Layer Middleware (CORBA, DCOM, etc) Seq. Web genbank proprietary SEQUENCE Tx. P App c. DNA µArraydb. Oligo Tx. P DB EXPRESSION Pub. Med Proprietary App Medline LITERATURE Relationship Service

Data Warehousing for Integration Data warehousing is a process as much as it is a repository. There a couple of primary concepts behind data warehousing: • • ETL (Extraction, Transformation, Load) Component-based (datamarts) Typically utilizes a dimensional model Metadata-driven

Data Warehousing Source Data Warehouse (integrated Datamarts) E (Extraction) T (Transformation) L (Load)

Data-level Integration Through Data Warehousing SEQUENCE EXPRESSION Seq. Web Tx. P App genbank proprietary c. DNA µArray DB LITERATURE Pub. Med Oligo Tx. P DB Proprietary App Medline Data Staging Layer - ETL Metadata layer Presentation Application Data Warehouse Presentation Application (Q 1 Q 2 Q 3) “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3 -fold expression differential between hyperprolactinemia and normal pituitary cells, and are homologous to known transcription factors. ”

Data Staging Storage area and set of processes that • extracts source data • transforms data • cleans incorrect data, resolves missing elements, standards conformance • purges fields not needed • combines data sources • creates surrogate keys for data to avoid dependence on legacy keys • builds aggregates where needed • archives/logs • loads and indexes data Does not provide query or presentation services

Data Staging (cont. ) • Sixty to seventy percent of development is here • Engineering is generally done using database automation and scripting technology • Staging environment is often an RDBMS • Generally done in a centralized fashion and as often as desired, having no effect on source systems • Solves the integration problem once and for all, for most queries

Warehouse Development and Deployment Two development paradigms: Top-down warehouse design: conceptualize the entire warehouse, then build, tends to take 2 years or more, and requirements change too quickly Bottom-up design and deployment: pivoted around completely functional subsections of the Warehouse architecture, takes 2 months, enables modular development.

Warehouse Development and Deployment (cont. ) The Data Mart: “A logical subset of the complete data warehouse” • represents a completable project • by itself is a fully functional data warehouse • A Data Warehouse is the union of all constituent data marts. • Enables bottom-up development

Warehouse Development and Deployment (cont. ) Examples of data marts in Life Science: – Sequence/Annotation - brings together sequence and annotation from public and proprietary dbs – Expression Profiling datamart - integrates multiple Tx. P approaches (c. DNA, oligo) – High-throughput screening datamart - stores HTS information on proprietary high-throughput compound screens – Clinical trial datamart - integrates clinical trial information from multiple trials è All of these data marts are pieced together along conformed entities as they are developed, bottom up

Advantages of Data-level Integration Through Data Warehousing • Integration of data occurs at the lowest level, eliminating the need for integration of queries and query results • Run-time semantic cleaning services are no longer required this work is performed in the data staging environment • FAST! • Original source systems are left completely untouched, and if they go down, the Data Warehouse still functions • Query optimization across multiple systems’ data can be performed • Readily suitable for data mining by generic visualization tools

Issues with Data-level Integration Through Data Warehousing • ETL process can take considerable time and effort • Requires an understanding of the domain to represent relationships among objects correctly • More scalable when accompanied by a Metadata repository which provides a layer of abstraction over the warehouse to be used by the application. Building this repository requires additional effort.

Indexing Data Sources • Indexes and links a large number of data sources (e. g. , files, URLs) • Data integration takes place by using the results of one query to link and jump to a keyed record in another location • Users have the ability to develop custom applications by using a vendor-specific language

Indexed Data Source Architecture I Sequence indexed data sources I Gx. P indexed data sources I SNP information Index Traversal Support Mechanism

Indexed Data Sources: Pros and Cons Advantages • quick to set up • easy to understand • achieves a basic level of integration with minimal effort Disadvantages • does not clean and normalize the data • does not have a way to directly integrate data from relational DBMSs • difficult to browse and mine • sometimes requires knowledge of a vendorspecific language

Memory-mapped Integration • The idea behind this approach is to integrate the actual analytical data in memory and not in a relational database system • Performance is fast since the application retrieves the data from memory rather than disk • True data integration is achieved for the analytical data but the descriptive or complementary data resides in separate databases

Memory Map Architecture Sample/Source Information Sequence DB #1 Sequence DB #2 Data Integration Layer Memory-mapped Integrated Data CORBA Descriptive Information

Memory Maps: Pros and Cons Advantages • true “analytical” data integration • quick access • cleans analytical data • simple matrix representation Disadvantages • typically does not put nonanalytical data (gene names, tissue types, etc. ) through the ETL process • not easily extensible when adding new databases with descriptive information • performance hit when accessing anything outside of memory (tough to optimize) • scalability restricted by memory limitations of machine • difficult to mine due to complicated architecture

The Need for Metadata For all of the previous approaches, one underlying concept plays a critical role to their success: Metadata is a concept that many people still do not fully understand. Some common questions include: • • What is it? Where does it come from? Where do you keep it? How is it used?

Metadata “The data about the data…” • Describes data types, relationships, joins, histories, etc. • A layer of abstraction, much like a middle layer, except. . . • Stored in the same repository as the data, accessed in a consistent “database-like” way

Metadata (cont. ) Back-end metadata - supports the developers Source system metadata: versions, formats, access stats, verbose information Business metadata: schedules, logs, procedures, definitions, maps, security Database metadata - data models, indexes, physical & logical design, security Front-end metadata - supports the scientist and application Nomenclature metadata - valid terms, mapping of DB field names to understandable names Query metadata - query templates, join specifications, views, can include back-end metadata Reporting/visualization metadata - template definitions, association maps, transformations Application security metadata - security profiles at the application level

Metadata Benefits • Enables the application designer to develop generic applications that grow as the data grows • Provides a repository for the scientist to become better informed on the nature of the information in the database • Is a high-performance alternative to developing an object-relational layer between the database and the application • Extends gracefully as the database extends

Integration Technologies • Technologies that support integration efforts • Data Interchange • Object Brokering • Modeling techniques

Data Interchange • Standards for inter-process and inter-domain communication • Two types of data • Data – the actual information that is being interchanged • Metadata – the information on the structural and semantic aspects of the Data • Examples: • EMBL format • ASN. 1 • XML

XML Emerges • Allows uniform description of data and metadata – Metadata described through DTDs – Data conforms to metadata description • Provides open source solution for data integration between components • Lots of support in Comp. Sci community (proportional to cardinality of Perl modules developed) – XML: : CGI - a module to convert CGI parameters to and from XML – XML: : DOM - a Perl extension to XML: : Parser. It adds a new 'Style' to XML: : Parser, called 'Dom', that allows XML: : Parser to build an Object Oriented data structure with a DOM Level 1 compliant interface. – XML: : Dumper - a simple package to experiment with converting Perl data structures to XML and converting XML to perl data structures. – XML: : Encoding - a subclass of XML: : Parser, parses encoding map XML files. – XML: : Generator is an extremely simple module to help in the generation of XML. – XML: : Grove - provides simple objects for parsed XML documents. The objects may be modified but no checking is performed. – XML: : Parser - a Perl extension interface to James Clark's XML parser, expat – XML: : QL - an early implementation of a note published by the W 3 C called "XML-QL: A Query Language for XML". – XML: : XQL - a Perl extension that allows you to perform XQL queries on XML object trees.

XML in Life Sciences • Lots of momentum in Bio community • GFF (Gene Finding Features) • GAME (Genomic Annotation Markup Elements) • BIOML (Bio. Polymer markup language) • EBI’s XML format for gene expression data • … • Will be used to specify ontological descriptions of Biology data

XML – DTDs • Interchange format defined through a DTD – Document Type Definition <!ELEMENT bioxml-game: seq_relationship (bioxml-game: span, bioxmlgame: alignment? )> <!ATTLIST bioxml-game: seq_relationship seq IDREF #IMPLIED type (query | subject | peer | subseq) #IMPLIED > • And data conforms to DTD <seq_relationship seq="seq 1 "type="query"> <span> <begin>10</begin> <end>15</end> </span> </seq_relationship> <seq_relationship seq="seq 2" type="subject"> <span> <begin>20</begin> <end>25</end> </span> <alignment> query: atgccg ||| || subject: atgacg </alignment> </seq_relationship>

XML Summary Benefits Drawbacks • Metadata and data have same format • HTML-like • Broad support in Comp. Sci and Biology • Sufficiently flexible to represent any data model • XSL style sheets map from one DTD to another • Doesn’t allow for abstraction or partial inheritance • Interchange can be slow in certain data migration tasks

Object Brokering • The details of data can often be encapsulated in objects – Only the interfaces need definition – Forget DTDs and data description • Mechanisms for moving objects around based solely on their interfaces would allow for seamless integration

Enter CORBA • Common Object Request Broker Architecture • Applications have access to method calls through IDL stubs • Makes a method call which is transferred through an ORB to the Object implementation • Implementation returns result back through ORB

CORBA IDL • IDL – Interface Definition Language – Like C++/Java headers, but with slightly more type flexibility

CORBA Summary Benefits Drawbacks Distributed Component-based architecture Promotes reuse Doesn’t require knowledge of implementation • Platform independent • Distributed • Level of abstraction is sometimes not useful • Can be slow to broker objects • Different ORBS do different things • Unreliable? • OMG website is brutal • •

Modeling Techniques E-R Modeling • Optimized for transactional data • Eliminates redundant data • Preserves dependencies in UPDATEs • Doesn’t allow for inconsistent data • Useful for transactional systems Dimensional Modeling • Optimized for queryability and performance • Does not eliminate redundant data, where appropriate • Constraints unenforced • Models data as a hypercube • Useful for analytical systems

Illustrating Dimensional Data Space Sample problem: monitoring a temperature-sensitive room for fluctuations z y x x, y, z, and time uniquely determine a temperature value: (x, y, z, t) temp Independent variables Dependent variables Nomenclature: “x, y, z, and t are dimensions” “temperature is a fact” “the data space is a hypercube of size 4”

Dimensional Modeling Primer • Represents the data domain as a collection of hypercubes that share dimensions – Allows for highly understandable data spaces – Direct optimizations for such configurations are provided through most DBMS frameworks – Supports data mining and statistical methods such as multidimensional scaling, clustering, self-organizing maps – Ties in directly with most generalized visualization tools – Only two types of entities - dimensions and facts

Dimensional Modeling Primer Relational Representation • Contains a table for each dimension • Contains one central table for all facts, with a multi-part key • Each dimension table has a single part primary key that corresponds to exactly one of the components of the multipart key in the fact table. X Dimension PK The Star Schema: the basic component of Dimensional Modeling Temperature Fact FK Y Dimension PK FK CK FK PK Time Dimension FK Z Dimension PK

Dimensional Modeling Primer Relational Representation • Each dimension table most often contains descriptive textual information about a particular scientific object. Dimension tables are typically the entry points into a datamart. Examples: “Gene”, “Sample”, “Experiment” • The fact table relates the dimensions that surround it, expressing a many-tomany relationship. The more useful fact tables also contain “facts” about the relationship -- additional information not stored in any of the dimension tables. X Dimension PK The Star Schema: the basic component of Dimensional Modeling Temperature Fact FK Y Dimension PK FK CK FK PK Time Dimension FK Z Dimension PK

Dimensional Modeling Primer Relational Representation • Dimension tables are typically small, on the order of 100 to 100, 000 records. Each record measures a physical or conceptual entity. • The fact table is typically very large, on the order of 1, 000 or more records. Each record measures a fact around a grouping of physical or conceptual entities. X Dimension PK The Star Schema: the basic component of Dimensional Modeling Temperature Fact FK Y Dimension PK FK CK FK PK Time Dimension FK Z Dimension PK

Dimensional Modeling Primer Relational Representation Neither dimension tables nor fact tables are necessarily normalized! • Normalization increases complexity of design, worsens performance with joins • Non-normalized tables can easily be understood with SELECT and GROUP BY • Database tablespace is therefore required to be larger to store the same data - the gain in overall performance and understandability outweighs the cost of extra disks! X Dimension PK The Star Schema: the basic component of Dimensional Modeling Temperature Fact FK Y Dimension PK FK CK FK PK Time Dimension FK Z Dimension PK

Case in Point: Sequence Clustering Run run_id who when purpose Result runkey(fk) seqkey(fk) “Show me all sequences in the same cluster as sequence XA 501 from my last run. ” Cluster cluster_id Param. Set paramset_id Sequence seq_id bases length Membership start length orientation Subcluster subcluster_id Parameters param_name param_value PROBLEMS • not browsable (confusing) • poor query performance • little or no data mining support SELECT SEQ_ID FROM SEQUENCE, MEMBERSHIP, SUBCLUSTER WHERE SEQUENCE. SEQKEY = MEMBERSHIP. SEQKEY AND MEMBERSHIP. SUBCLUSTERKEY = SUBCLUSTERKEY AND SUBCLUSTERKEY = ( SELECT CLUSTERKEY FROM SEQUENCE, MEMBERSHIP, SUBCLUSTER, RESULT, RUN WHERE SEQUENCE. RESULTKEY = RESULTKEY AND RESULT. RUNKEY = RUNKEY AND SEQUENCE. SEQKEY = MEMBERSHIP. SEQKEY AND MEMBERSHIP. SUBCLUSTERKEY = SUBCLUSTERKEY AND SUBCLUSTERKEY = CLUSTERKEY AND SEQUENCE. SEQID = ‘XA 501’ AND RESULT. RUNID = ‘my last run’ )

Dimensionally Speaking… Sequence Clustering CONCEPTUAL IDEA - The Star Schema: A historical, denormalized, subject-oriented view of scientific facts -- the data mart. A centralized fact table stores the single scientific fact of sequence membership in cluster and a subcluster. Smaller dimensional tables around the fact table represent key scientific objects (e. g. , sequence). “Show me all sequences in the same cluster as sequence XA 501 from my last run. ” Membership Facts Parameters paramset_id param_name param_value seq_id cluster_id subcluster_id run_id paramset_id run_date run_initiator seq_start seq_end seq_orientation cluster_size subcluster_size Run run_id run_date run_initiator run_purpose run_remarks Benefits • Highly browsable, understandable model for scientists • Vastly improved query performance • Immediate data mining support • Extensible “database componentry” model Sequence seq_id bases length type SELECT SEQ_ID FROM MEMBERSHIP_FACTS WHERE CLUSTER_ID IN ( SELECT CLUSTER_ID FROM MEMBERSHIP_FACTS WHERE SEQ_ID = ‘XA 501’ AND RUN_ID = ‘my last run’ )

Dimensional Modeling Strengths • Predictable, standard framework allows database systems and end user query tools to make strong assumptions about the data • Star schemas withstand unexpected changes in user behavior -every dimension is equivalent: symmetrically equal entry points into the fact table. • Gracefully extensible to accommodate unexpected new data elements and design decisions • High performance, optimized for analytical queries

The Need for Standards In order for any integration effort to be successful, there needs to be agreement on certain topics: • Ontologies: concepts, objects, and their relationships • Object models: how are the ontologies represented as objects • Data models: how the objects and data are stored persistently

Standard Bio-Ontologies Currently, there are efforts being undertaken to help identify a practical set of technologies that will aid in the knowledge management and exchange of concepts and representations in the life sciences. GO Consortium: http: //genome-www. stanford. edu/GO/ The third annual Bio-Ontologies meeting is being held after ISMB 2000 on August 24 th.

Standard Object Models Currently, there is an effort being undertaken to develop object models for the different domains in the Life Sciences. This is primarily being done by the Life Science Research (LSR) working group within the OMG (Object Management Group). Please see their homepage for further details: http: //www. omg. org/homepages/lsr/index. html

In Conclusion • Data integration is the problem to solve to support human and computer discovery in the Life Sciences. • There a number of approaches one can take to achieve data integration. • Each approach has advantages and disadvantages associated with it. Particular problem spaces require particular solutions. • Regardless of the approach, Metadata is a critical component for any integrated repository. • Many technologies exist to support integration. • Technologies do nothing without syntactic and semantic standards.

Accessing Integrated Data Once you have an integrated repository of information, access tools enable future experimental design and discovery. They can be categorized into four types: – – browsing tools query tools visualization tools mining tools

Browsing One of the most critical requirements that is overlooked is the ability to “browse” the integrated repository since users typically do not know what is in it and are not familiar with other investigator’s projects. Requirements include: • ability to view summary data • ability to view high level descriptive information on a variety of objects (projects, genes, tissues, etc. ) • ability to dynamically build queries while browsing (using a wizard or drag and drop mechanism)

Querying Along with browsing, retrieving the data from the repository is one of the most underdeveloped areas in bioinformatics. All of the visualization tools that are currently available are great at visualizing data. But if users cannot get their data into these tools, how useful are they? Requirements include: • ability to intelligently help the user build ad-hoc queries (wizard paradigm, dynamic filtering of values) • provide a “power user” interface for analysts (query templates with the ability to edit the actual SQL) • should allow users to iterate over the queries so they do not have to build them from scratch each time • should be tightly integrated with the browser to allow for easier query construction

Visualizing There a number of visualization tools currently available to help investigators analyze their data. Some are easier to use than others and some are better suited for either smaller or larger data sets. Regardless, they should all provide the ability to: • be easy to use • save templates which can be used in future visualizations • view different slices of the data simultaneously • apply complex statistical rules and algorithms to the data to help elucidate associations and relationships

Data Mining Life science has large volumes of data that, in its rawest form, is not easy to use to help drive new experimentation. Ideally, one would like to automate data mining tools to extract “information” by allowing them to take advantage of a predicable database architecture. This is more easily attainable using dimensional modeling (star schemas), however, since E-R schemas are very different from database to database and do not conform to any standard architecture.

Database Schemas for 3 independent Genomics systems Seq_DB_Key Species Seq_DB_Name SEQUENCE Sequence_Key Map_Key Qualifier_Key Seq_DB_Key Type Name Homology Data ORGANISM Organism_Key SEQUENCE_DATABASE Seq_DB_Key SCORE Score_Key PARAMETER_SET Parametet_Set_Key Alignment_Key P_Value Score Percent_Homology Algorithm_key GE_RESULTS Results_Key QUALIFIER Qualifier_Key Map_Key Chip_Key Gene_Name GENOTYPE Genotype_Key ALIGNMENT Alignment_Key ALGORITHM Algorithm_key Sequence_Key Name Algorithm_Name CELL_LINE Cell_Line_Key RNA_SOURCE RNA_Source_Key Treatment_Key Genotype_Key Cell_Line_Key Tissue_Key Disease_Key Species Name MAP_POSITION Map_Key STS_SOURCE Source_Key SNP_METHOD Method_Key PCR_BUFFER Buffer_Key ALLELE Allele_Key Map_Key Allele_Name Base_Change PCR_PROTOCOL Protocol_Key Method_Key Source_Key Buffer_Key Linkage_Key Disease_Linkage_Distance CHIP Chip_Key SNP_FREQUENCY Frequency_Key Chip_Name Species Linkage_Key Population_Key Allele_Frequency Sample_Size TISSUE Tissue_Key Name SNP Data TREATMENT Treatmemt_Key Name DISEASE Disease_Key Name Gene Expression SNP_POPULATION Population_Key Analysis_Key Parameter_Set_Key Qualifier_Key RNA_Source_Key Expression_Level Absent_Present Fold_Change Type PARAMETER_SET Parameter_Set_Key ANALYSIS Analysis_Key Analysis_Decision

The Warehouse Three star schemas of heterogenous data joined through a conformed dimension GENE_EXPRESSION_RESULT RNA_SOURCE RNA_Source_Key Treatment Disease Tissue Cell_Line Genotype Species RNA_Source_Key_Exp (FK) RNA_Source_Key_Bas Sequence_Key (FK) Parameter_Set_Key (FK) Expression_Level_Exp Expression_Level_Bas Absent_Present_Exp Absent_Present_Bas Analysis_Decision Chip_Type Fold_Change Parameter_Set_Key STS_Source_Key Algorithm_Name SEQUENCE Sequence_Key (FK) STS_Source_Key (FK) STS_Protocol_Key (FK) Allele_Frequency Sample_Size Allele_Name Base_Change Disease_Linkage_Distance Gene Expression HOMOLOGY_PARAMETER_SET STS_SOURCE SNP_RESULT GENECHIP_PARAMETER_SET Parameter_Set_Key Sequence Seq_Type Seq_ID Seq_Database Map_Position Species Gene_Name Description Qualifier SEQUENCE_HOMOLOGY_RESULT Query_Sequence_Key (FK) Target_Sequence_Key Parameter_Set_Key (FK) Database_Key (FK) Score P_Value Alignment Percent_Homology HOMOLOGY_DATABASE STS_PROTOCOL Database_Key STS_Protocol_Key PCR_Protocol PCR_Buffer SNP Data Conformed “sequence” dimension Seq_DB_Name Species Last_Updated Homology Data