Data integration architectures and methodologies for the Life

Outline of the talk § § § The problem and challenges faced Historical background

1. The Problem § Given a set of biological data sources, data integration (DI)

Challenges faced § Increasingly large volumes of complex, highly varying, biological data are being

Biological data: Genes Proteins Biological Function Genome: DNA sequences of 4 bases (A, C,

Varieties of Biological Data § § § § § genomic data gene expression data

Some Key Application Areas for DI § Integrating, analysing and annotating genomic data §

2. Historical Background § One possible approach would be to encode transformation/ integration functionality

3. Main DI Approaches in the Life Sciences § Materialised • import data into

Main DI Approaches in the Life Sciences § Link-based • no integrated schema •

4. Comparing the Main Approaches § Link-based integration is fine if functionality meets users´

5. Our work: Auto. Med The Auto. Med Project at Birkbeck and Imperial: §

Auto. Med Architecture Schema Transformation and Integration Tools Schema and Transformation Repository Wrapper Global

Auto. Med Features § Schema transformations are automatically reversible: • add. T/delete. T(c, q)

Auto. Med vs Common Data Model approach October 2007

Some characteristics of Biological DI § prevalence of automated and manual annotation of data

The Bio. Map Data Warehouse § A data warehouse integrating • gene expression data

Bio. Map integration approach October 2007

Bio. Map architecture Structure Data MSD, CATH… Function Data GO, KEGG… Data Marts Structure

Using Auto. Med in the Bio. Map Project § § Integrated Database Wrapper Auto.

7. Virtual DI § The integrated schema may be defined in a standard data

Virtual DI Architecture Wrappers Schema Integration Tools Metadata Repository: • Data source schemas •

Degree of Data Source Overlap § different systems make different assumptions about this §

Virtual DI methodologies § Top-down • integrated schema IS is first constructed • or

Virtual DI methodologies § Bottom-up • initial version of IS and M constructed e.

Virtual DI methodologies § Mixed Top-down and Bottom-up • initial IS may exist •

Defining Mappings § Global-as-view (GAV) • each schema object in IS defined by a

Both-As-View approach supported by Auto. Med § not based on views between integrated and

Typical BAV Integration Network GS id US 1 DS 1 October 2007 id US

Typical BAV Integration Network (cont’d) § On • • the previous slide: GS is

8. Schema Evolution § In biological DI, data sources may evolve their schemas to

Schema Evolution in BAV § BAV supports the evolution of New Global both data

Global Schema Evolution § Each transformation step t in T: S S’ is considered

Local Schema Evolution § This is a bit more complicated as it may require

9. Some Future Directions in Biological DI § Automatic or semi-automatic identification of correspondences

9. 1 Harnessing Grid Technologies – ISPIDER § ISPIDER Project Partners: Birkbeck, EBI, Manchester,

my. Grid collection of services/components allowing high-level integration via workflows of data and applications

Auto. Med – DQP Interoperability § § Data sources wrapped with OGSA-DAI Auto. Med-DAI

9. 2 Bioinformatics Service Reconciliation § Plethora of bioinformatics services are being made available

Previous Approaches § Shims. my. Grid uses shims, i. e. services that act as

Our approach § XML as the common representation format § Assume availability of format

Our approach § XMLDSS as the schema type § We use our XMLDSS schema

Our approach § Correspondences to an ontology § Set of GLAV corrrespondences between each

Our approach § Schema and data transformation § a pathway is generated to transform

Architecture § A workflow tool could use our approach either dynamically or statically: §

10. Conclusions § Integrating biological data sources is hard! § The overarching motivation is

Slides: 52

Download presentation

Data integration architectures and methodologies for the Life Sciences Alexandra Poulovassilis, Birkbeck, U. of London October 2007

Outline of the talk § § § The problem and challenges faced Historical background Main Data Integration approaches in the Life Sciences Our work Materialised and Virtual DI Future directions • ISPIDER Project • Bioinformatics service reconciliation October 2007

1. The Problem § Given a set of biological data sources, data integration (DI) is the process of creating an integrated resource which • combines data from each of the data sources • in order to support new queries and analyses § Biological data sources are characterised by their high degree of heterogeneity, in terms of: • data model, query interfaces, query processing capabilities, database schema or data exchange format, data types used, nomenclature adopted § Coupled with the variety, complexity and large volumes of biological data, this poses several challenges, leading to several methodologies, architectures and systems being developed October 2007

Challenges faced § Increasingly large volumes of complex, highly varying, biological data are being made available § Data sources are developed by different people in differing research environments for differing purposes § Integrating them to meet the needs of new users and applications requires reconciliation of their heterogeneity w. r. t. content, data representation/exchange and querying § Data sources may freely change their format and content without considering the impact on any integrated derived resources § Integrated resources may themselves become data sources for high-level integrations, resulting in a network of dependencies October 2007

Biological data: Genes Proteins Biological Function Genome: DNA sequences of 4 bases (A, C, G, T) RNA: copy of DNA sequence Protein: sequence of 20 amino acids Biological Processes FUNCTION A gene Permanent copy Temporary copy This slide is adapted from Nigel Martin’s Lecture Notes on Bioinformatics October 2007 Product (each triple of RNA bases encodes an amino acid) Job

Varieties of Biological Data § § § § § genomic data gene expression data (DNA proteins) and gene function data protein structure and function data regulatory pathway data: how gene expression is regulated by proteins cluster data: similarity-based clustering of genes or proteins proteomics data: from experiments on separating proteins produced by organisms into peptides, and protein identification phylogenetic data: evolution of genomic, protein, function data on genomic variations in species semi-structured/unstructured data: medical abstracts October 2007

Some Key Application Areas for DI § Integrating, analysing and annotating genomic data § Predicting the functional role of genes and integrating functionspecific information § Integrating organism-specific information § Integrating protein structure and pathway data with gene expression data, to support functional genomics analysis § Integrating, analysing and annotating proteomics data sources § Integrating phylogenetic data sources for genealogy research § Integrating data on genomic variations to analyse health impact § Integrating genomic, proteomic and clinical data for personalised medicine October 2007

2. Historical Background § One possible approach would be to encode transformation/ integration functionality in the application programs § However, this may be a complex and lengthy process, and may affect robustness, maintainability, extensibility § This has motivated the development of generic architectures and methodologies for DI, which abstract out this functionality from application programs into generic DI software § Much work has been done since the 1990 s specifically in biological DI § Many systems have been developed e. g. Discovery. Link, Kleisli, Tambis, Bio. Mart, SRS, Entrez, that aim to address some of the challenges faced October 2007

3. Main DI Approaches in the Life Sciences § Materialised • import data into a DW • transform & aggregate imported data • query the DW via the DBMS § Virtual • specify the integrated schema • “wrap” the data sources, using wrapper software • construct mappings between data sources and IS using mediator software • query the integrated schema • mediator software coordinates query evaluation, using the mappings and wrappers October 2007

Main DI Approaches in the Life Sciences § Link-based • no integrated schema • users submit simple queries to the integration software e. g. via web-based user interface • queries are formulated w. r. t to the data sources, as selected by the user • the integration software provides additional capabilities for • facilitating query formulation e. g. cross-references are maintained between different data sources and used to augment query results with links to other related data • speeding up query evaluation e. g. indexes are maintained supporting efficient keyword based search October 2007

4. Comparing the Main Approaches § Link-based integration is fine if functionality meets users´ needs § Otherwise materialised or virtual DI is indicated: • both allow the integrated resource to be queried as though it were a single data source. Users/applications do not need to be aware of source schemas/formats/content § Materialised DI is generally adopted for: • better query performance • greater ease of data cleaning and annotation § Virtual DI is generally adopted for: • lower cost of storing and maintaining the integrated resource • greater currency of the integrated resource October 2007

5. Our work: Auto. Med The Auto. Med Project at Birkbeck and Imperial: § is developing tools for the semi-automatic integration of heterogeneous information sources § can handle both structured and semi-structured data § provides a unifying graph-based metamodel (HDM) for specifying higher-level modelling languages § provides a single framework for expressing data cleansing, transformation and integration logic § the Auto. Med toolkit is currently being used for biological data integration and p 2 p data integration October 2007

Auto. Med Architecture Schema Transformation and Integration Tools Schema and Transformation Repository Wrapper Global Query Processor/Optimiser Model Definition Tool Model Definitions Repository Schema Matching Tools Other Tools e. g. GUI, schema evolution, DLT October 2007

Auto. Med Features § Schema transformations are automatically reversible: • add. T/delete. T(c, q) by delete. T/add. T(c, q) • extend. T(c, Range q 1 q 2) by contract. T(c, Range q 1 q 2) • rename. T(c, n, n’) by rename. T(c, n’, n) § Hence bi-directional transformation pathways (more generally transformation networks) are defined between schemas § The queries within transformations allow automatic data and query translation § Schemas may be expressed in a variety of modelling languages § Schemas may or may not have a data source associated with them October 2007

Auto. Med vs Common Data Model approach October 2007

6. Materialised DI October 2007

Some characteristics of Biological DI § prevalence of automated and manual annotation of data • prior, during and after its integration • e. g. DAS distributed annotation service; GUS data warehouse annotation of data origin and data derivation § importance of being able to trace the provenance of data § wide variety of nomenclatures adopted • greatly increases the difficulty of data aggregation • has led to many standardised ontologies and taxonomies § inconsistencies in identification of biological entities • has led to standardisation efforts e. g. LSID • but still a legacy of non-standard identifiers present October 2007

The Bio. Map Data Warehouse § A data warehouse integrating • gene expression data • protein structure data including • data from the Macromolecular Structure Database (MSD) from the European Bioinformatics Institute (EBI) • CATH structural classification data • functional data including • Gene Ontology; KEGG • hierachical clustering data, derived from the above § Aiming to support mining, analysis and visualisation of gene expression data October 2007

Bio. Map integration approach October 2007

Bio. Map architecture Structure Data MSD, CATH… Function Data GO, KEGG… Data Marts Structure Tables Search Tools Cluster Data Function Tables Metadata Tables Analysis Tools Cluster Tables Search Tables (Materialised views) Microarray Data (Array. Express) Mining Tools Microarray Tables MEditor October 2007 Visualisation Tools

Using Auto. Med in the Bio. Map Project § § Integrated Database Wrapper Auto. Med Integrated Schema Auto. Med Relational Schema …. . Auto. Med XMLDSS Schema RDB Wrapper …. . XML Wrapper RDB …. . RDB October 2007 ion at rm y sfo wa an th Tr pa § Integrated Database Transformation pathway § Wrapping of data sources and the DW Automatic translation of source and global schemas into Auto. Med’s XML schema language (XMLDSS) Domain experts provide matchings between constructs in source and global schemas: rename transfs. Automatic schema restructuring and generation of transformation pathways Pathways could subsequently be used for maintaince and evolution of the DW; also for data lineage tracing See DILS’ 05 paper for details of the architecture and clustering approach Tr an s pa form thw at ay ion § § XML File

7. Virtual DI § The integrated schema may be defined in a standard data modelling language § Or, more broadly, it may be a source-independent ontology • defined in an ontology language • serving as a “global” schema for multiple potential data sources, beyond the ones being integrated e. g. as TAMBIS § The integrated schema may/may not encompass all of the data in the data sources: • it may be sufficient to capture just the data needed for answering key user queries/analyses • this avoids the possibly complex and lengthy process of creating a complete integrated schema and set of mappings October 2007

Virtual DI Architecture Wrappers Schema Integration Tools Metadata Repository: • Data source schemas • Integrated schemas • Mappings Global Query Processor Global Query Optimiser October 2007

Degree of Data Source Overlap § different systems make different assumptions about this § some systems assume that each DS contributes a different part of the integrated virtual resource e. g. K 2/Kleisli § some systems relax this but do not attempt any aggregation of duplicate or overlapping data from the DSs e. g. TAMBIS § some systems support aggregation at both schema and data levels e. g. Discovery. Link, Auto. Med § the degree of data source overlap impacts on complexity of the mappings and the design effort involved in specifying them § the complexity of the mappings in turn impacts on the sophistication of the global query optimisation and evaluation mechanisms that will be needed October 2007

Virtual DI methodologies § Top-down • integrated schema IS is first constructed • or it may already exist from previous integration or standardisation efforts • the set of mappings M is defined between IS and DS schemas October 2007

Virtual DI methodologies § Bottom-up • initial version of IS and M constructed e. g. from one DS • these are incrementally extended/refined by considering in turn each of the other DSs • for each object O in each DS, M is modified to encompass the mapping between O and IS, if possible • if not, IS is extended as necessary to encompass information represented by O, and M is then modified accordingly October 2007

Virtual DI methodologies § Mixed Top-down and Bottom-up • initial IS may exist • initial set of mappings M is specified • IS and M may need to be extended/refined by considering additional data from the DSs that IS needs to capture • for each object O in each DS that IS needs to capture, M is modified to encompass the mapping between O and IS, if possible • if not, IS is extended as necessary to encompass information represented by O, and M is then modified accordingly October 2007

Defining Mappings § Global-as-view (GAV) • each schema object in IS defined by a view over DSs • simple global query reformulation by query unfolding • view evolution problems if DSs change § Local-as-view (LAV) • each schema object in a DS defined by a view over IS • harder global query reformulation using views • evolution problems if IS changes § Global-local-as-view (GLAV) • views relate multiple schema objects in a DS with IS October 2007

Both-As-View approach supported by Auto. Med § not based on views between integrated and source schemas § instead, provides a set of primitive schema transformations each adding, deleting or renaming just one schema object § relationships between source and integrated schema objects are thus represented by a pathway of primitive transformations § add, extend, delete, contract transformations are accompanied by a query defining the new/deleted object in terms of the other schema objects § from the pathways and queries, it is possible to derive GAV, LAV, GLAV mappings § currently Auto. Med supports GAV, LAV and combined GAV+LAV query processing October 2007

Typical BAV Integration Network GS id US 1 DS 1 October 2007 id US 2 DS 2 id … … id USi DSi id … … USn DSn

Typical BAV Integration Network (cont’d) § On • • the previous slide: GS is a global schema DS 1, …, DSn are data source schemas US 1, …, USn are union-compatible schemas the transformation pathways between each pair LSi and USi may consist of add, delete, rename, expand contract primitive transformation, operating on any modelling construct defined in the Auto. Med Model Definitions Repository • the transformation pathway between USi and GS is similar • the transformation pathway between each pair of unioncompatible schemas consists of id transformation steps October 2007

8. Schema Evolution § In biological DI, data sources may evolve their schemas to meet the needs of new experimental techniques or applications § Global schemas may similarly need to evolve to encompass new requirements § Supporting schema evolution in materialised DI is costly: requires modifying the ETL and view materialisation processes, plus the processes maintaining any derived data marts § With view-based virtual DI approaches, the sets of views that may be affected need to be examined and redefined October 2007

Schema Evolution in BAV § BAV supports the evolution of New Global both data source and global Schema S’ schemas T § The evolution of any schema is specified by a transformation Global Schema pathway from the old to the S new schema § For example, the figure on the right shows transformation pathways, T, from an old to a new global or data source T Data Source New Data Source schema S Schema S’ October 2007

Global Schema Evolution § Each transformation step t in T: S S’ is considered in turn • if t is an add, delete, rename then schema equivalence is preserved and there is nothing further to do (except perhaps optimise the extended transformation pathway, using an Auto. Med tool that does this); the extended pathway can be used to regenerate the necessary GAV or LAV views • if t is a contract then there will be information present in S that is no longer available in S’; again there is nothing further to do • if t is an extend then domain knowledge is required to determine if, and how, the new construct in S’ could be derived from existing constructs; if not, nothing further to do; if yes, the extend step is replaced by an add step October 2007

Local Schema Evolution § This is a bit more complicated as it may require changes to be propagated also to the global schema(s) § Again each transformation step t in T: S S’ is considered in turn § In the case that t is an add, delete, rename or contract step, the evolution can be carried out automatically § If it is an extend, then domain knowledge is required § See our CAi. SE’ 02, ICDE’ 03 and ER’ 04 papers for more details § The last of these discusses a materialised DI scenario where the old/new global/source schemas have an extent § We are currently implementing this functionality within the Auto. Med toolkit October 2007

9. Some Future Directions in Biological DI § Automatic or semi-automatic identification of correspondences between sources, or between sources and global schemas e. g. • name-based and structural comparisons of schema elements • instance-based matching at the data level • annotation of data sources with terms from ontologies to facilitate automated reasoning § Capturing incomplete and uncertain information about the data sources within the integrated resource e. g. using probabilistic or logic-based representations and reasoning § Automating information extraction from textual sources using grammar and rule-based approaches; integrating this with other related structured or semi-structured data October 2007

9. 1 Harnessing Grid Technologies – ISPIDER § ISPIDER Project Partners: Birkbeck, EBI, Manchester, UCL § Aims: • Large volumes of heterogeneous proteomics data • Need for interoperability • Need for efficient processing • Development of Proteomics Grid Infrastructure, use existing proteomics resources and develop new ones, develop new proteomics clients for querying, visualisation, workflow etc. October 2007

Project Aims October 2007

my. Grid collection of services/components allowing high-level integration via workflows of data and applications § DQP: • uses OGSA-DAI (Open Grid Services Architecture Data Access and Integration) to access data sources • provides distributed query processing over OGSA-DAI enabled resources § Current research: Auto. Med – DQP and Auto. Med – my. Grid workflows interoperation § See DILS´ 06 and DILS´ 07 papers, respectively § my. Grid: / DQP / Auto. Med October 2007

Auto. Med – DQP Interoperability § § Data sources wrapped with OGSA-DAI Auto. Med-DAI wrappers extract data sources’ metadata Semantic integration of data sources using Auto. Med transformation pathways into an integrated Auto. Med schema IQL queries submitted to this integrated schema are: • reformulated to IQL queries on the data sources, using the Auto. Med transformation pathways • Submitted to DQP for evaluation via the Auto. Med. DQP Wrapper October 2007

9. 2 Bioinformatics Service Reconciliation § Plethora of bioinformatics services are being made available § Semantically compatible services are often not able to interoperate automatically in workflows due to • different service technologies • differences in data model, data modelling, data types need for service reconciliation October 2007

Previous Approaches § Shims. my. Grid uses shims, i. e. services that act as intermediaries between specific pairs of services and reconcile their inputs and outputs § Bowers & Ludäscher (DILS’ 04) use 1 -1 path correspondences to one or more ontologies for reconciling services. Sample implementation uses mappings to a single ontology and generates an XQuery query as the transformation program § Thakkar et al. use a mediator system, like us, but for service integration i. e. for providing services that integrate other services – not for reconciling semantically compatible services that need to form a pipeline within a workflow October 2007

Our approach § XML as the common representation format § Assume availability of format converters to convert to/from XML, if output/input of a service is not XML October 2007

Our approach § XMLDSS as the schema type § We use our XMLDSS schema type as the common schema type for XML Can be automatically derived from DTD/XML Schema, if available Or can be automatically extracted from an XML document § § October 2007

Our approach § Correspondences to an ontology § Set of GLAV corrrespondences between each XMLDSS schema and a typed ontology: • An element maps to a concept/path in the ontology • An attribute maps to a literalvalued property/path • There may be multiple correspondences for elements/attributes in the ontology October 2007

Our approach § Schema and data transformation § a pathway is generated to transform X 1 to X 2: § correspondences are used to create X 1’ and X 2’ XMLDSS restructuring algorithm creates X 1’ X 2’ hence overall pathway X 1’ X 2 § § October 2007

Architecture § A workflow tool could use our approach either dynamically or statically: § Mediation service • Workflow tool invokes service S 1 and receives its output • Workflow tool submits output of S 1, the schema of S 2 and the two sets of correspondences to an Auto. Med service • The Auto. Med service transforms the output of S 1 to a suitable input for consumption by S 2 Shim generation • Auto. Med is used to generate a shim for services S 1 and S 2 • XMLDSS schema transformation algorithm currently tightly coupled with Auto. Med functionality can be exported as single XQuery query able to materialise S 2 from the data output by S 1 § October 2007

10. Conclusions § Integrating biological data sources is hard! § The overarching motivation is the potential to make scientific discoveries that can improve quality of life § The technical challenges faced can lead to new, more generally applicable, DI techniques § Thus, biological data integration continues to be a rich field for multi- and interdiscplinary research between clinicians, biologists, bioinformaticians and computer scientists October 2007