Integrated Microbial Genomes IMG System A Case Study
Integrated Microbial Genomes (IMG) System A Case Study in Biological Data Management Different views on biological data management Victor M. Markowitz (Frank VLDB 2004 Panel on Biological Data Management) Korzeniewski v Computer Scientists Krishna Palaniappan Ernestresearch Szeto Source of problems for database Biologicalin. Data Management • Publication database papers & Technology Center • Prototypes Lawrence Berkeley National Lab v Biologists Nikos C. Kyrpides Vehicle for rapid data analysis Natalia N. Ivanova • Publication. Microbial in biology papers Genome Analysis Program • Immediate solutions Joint Genome Institute 1
Biological Data Management Problem Effective data analysis involves combining data from multiple sources • single data type • multiple data types data generation & collection data association in the context of inherently imprecise data 2
Background: Microbial Genomes Jan 04: 532 microbial genome projects Mar 05: 847 microbial genome projects Applications: Healthcare, environmental cleanup, agriculture, industrial processes, alternative energy production 3
Microbial Genome Data Analysis Context 4
Data Analysis Example: Occurrence Profiles ? Genome X Genes: x 1 x 2 x 3 ? x 4 Pathway v y 4 y 3 y 2 y 1 R 4 (e 4) R 3 (e 3) Functionally related genes tend to cluster on chromosome Genome Y R 4 (e 2) R 1 (e 1) Proteins from same cellular pathway are expected to co-occur in the majority of organisms from a phylogenetic branch Key Challenges o Representing abstract concepts with experimental data o Specifying individual and composite operations o Data coherence, completeness, integration 5
Microbial Genomes: Data Generation & Collection v Process o Raw data • Small DNA sequence fragments • Assembled sequence fragments (contigs) • Complete (one contiguous) sequence o Interpreted data • • v Gene prediction (models) Functional prediction (annotations) Expert data validation (cleaning) Expert annotations Data Processing & Refinement Evolution & diversity of Technology platforms Algorithms & parameters Experimental, data collection conditions Key Challenges o Diversity of data sources • Differences in models, depth/breadth of annotations o Consistency of the data transformation process 6
Data Transformation Process Example Microbial Genome Annotation Pipeline (ORNL) Sequence Data Files Fetch ORF Calling Preliminary Functional Annotation Post Annotation Data Files Replace Microbial Genome Annotation Review & Correction (JGI) Download Data For Review Report Download IMG Loading IMG Load Final Review & Lock Revised Annotation Data Files Data Cleansing Data Review Annotation Data Files Reference Genes NR IMG 7
Microbial Genomes: Data Association Organisms Predicted Genes Functions v Key Challenges o Data quality/precision for different types of data, sources o Transience of identifiers, relationships 8
Biological Data Management Problem Revisited Effective data analysis involves combining data from multiple sources in the context of inherently imprecise data while addressing • Data quality – Data semantics, precision, integrity, provenance • System quality – Comprehensibility, performance, reliability, scalability • Development strategy Challenging – Choice of technologies in academic – Devising (cost, time) effective solutions settings 9
Needed: System Development Framework Time /Cost Constraints Stages Requirements Specification Data Model Abstraction Design & Planning Prototype Database, Tools Docs Requirements Analysis Requirement Examples * System Development Use Scenarios Case Studies Program Test Develop System* Deploy System Definitions Document Plans & Schedules Preliminary Release Development Documents Revise & Refine Final Release 10
Requirement Analysis Example: IMG Data Analysis Find “unique” genes in a genome of interest Ψ 0 wrt related genomes: Ψ 1 , …, Ψk Iterate Query construction Query results Collect genes of interest “Similar” gene analysis Chromosomal neighborhood analysis 11
Data Model Abstraction v Motivation o Adds precision o Allows reasoning in an established framework • Analogies to traditional data domain v Biological data modeling o Data warehouse concepts • Proven technology for large scale biological data management applications o Data Structure • Multidimensional data space – Gene, genome, function/ pathway o Operations • Multidimensional space selections, projections, aggregations – Slice & dice, roll up, drill down… analogies 12
Gene occurrence profiles across pathways Genes Data Model Abstraction Example: IMG Operations g 1 + + + g 2 + + - + + g 3 + - - Genes Gene occurrence • “in” profile G 1 across • “in”genomes G 2 • “not in” G 3 • “in” G 4 • “in” G 5 Genomes G 1 G 2 Functions/ Pathways G 3 G 4 G 5 Pathways shared by genomes 14
Data Analysis Example: Searching for Unique Genes parasite in horses Causes human disease in tropical areas (melioidosis) 15
Identifying Unique Genes of Interest Genes involved in adherence and invasion 16
Exploring Unique Gene Details 17
Summary v Needed Effective solutions for academic biological data management o Employing appropriate technologies and methods o Developed within (time, cost) constraints v IMG Case Study o System development process framework essential for • Continuously evolving content – aiming at coherence, completeness • Developing meaningful data analysis tools • Clarity of methods, parameters, results o Metric for success • Community adoption and support • Increase in analysis productivity and value 18
- Slides: 17