Micro B 3 Information System Bringing sequence data
Micro B 3 Information System Bringing sequence data into environmental context Microbial Genomics and Bioinformatics Research Group Renzo Kottmann rkottman@mpi-bremen. de @renzokott Hinxton, 2014 -03 -27
Ecosystem Perspective 2
Data Perspective genomes metagenomes transcriptomes latitude longitude marker genes proteomes collection date depth water currents temperature Omics Data Environmental Data
Data Perspective genomes metagenomes transcriptomes latitude longitude marker genes proteomes collection date depth water currents temperature Omics Data Environmental Data Result: Relationship
Data Flow Perspective genomes metagenomes transcriptomes marker genes proteomes Knowledge Study depth longitude water currents Field Web Access Omics Data Integration collection date latitude Environmental Data Result: Relationship Archival temperature Computing Laboratory
Data Flow Perspective: Issues genomes metagenomes transcriptomes marker genes proteomes Knowledge Study depth longitude water currents Field Web Access Omics Data Integration collection date latitude Environmental Data Quantity Heterogeneity Complexity Archival temperature Computing Laboratory
Data Integration genomes metagenomes transcriptomes marker genes proteomes collection date latitude Knowledge Study depth longitude water currents Field Web Access Omics Data temperature Environmental Data Integration + Analysis Integration Result: Relationship Archival Computing Laboratory
Data Integration: Geo-referencing genomes metagenomes transcriptomes marker genes proteomes y = latitude Knowledge Study t = collection date z = depth x = longitude water currents Field Web Access Omics Data temperature Environmental Data Integration + Analysis Integration Result: Relationship Archival Computing Laboratory
Micro B 3: Biodiversity, Bioinformatics, Biotechnology Knowledge Study Field Web Access Laboratory Integration Archival Computing
Micro B 3: Biodiversity, Bioinformatics, Biotechnology Micro B 3 Information System
Definition: Information System 4 information system, an integrated set of components for collecting, storing, and processing data and for delivering information, knowledge, and digital products. (http: //www. britannica. com/EBchecked/topic/287895/information-system, last visit 2013 -03 -13)
Information System: Logic View Collecting storing, and processing data and for delivering information modified from http: //martinfowler. com/articles/big. Data/
Information System: Process View modified from http: //martinfowler. com/articles/big. Data/
Information System: Process View – Data Convergence How to combine heterogeneous data? How to gain useful data? How to gather data? How to find relevant data?
Information System: Process View – Data Divergence How to enhance data? How to find relevant patterns? How to visualize and operationalize information for knowledge creation?
Information System: Science driven Which data? How to process and analyze? + e t ra e n Ge What is the geographic and environmental distribution of my gene? Scientists = kno wledg e How to visualize and operationalize information for knowledge creation?
So why all that? 4 To paraphrase Captain Kirk in the Star Trek: • “Data is a messy business— a very, very messy business. ” episode “A Taste of Armageddon” 4 “… as much as 60 percent of the time I spend on data analysis is focused on preparing the data for analysis. “ • R in Action: Data analysis and graphics with R by Robert I. Kabacoff
Gathering & Services Data Tracking Data Services 4 How to track the geographic 4 How to analyze, visualize - and environmental origin of and interpret the sequence DNA sequence data? data in an environmental context?
Information System: Science driven Which data? How to process and analyze? + e t ra e n Ge What is the geographic and environmental distribution of my gene? Scientists Data Tracking: • OSD App • OSD Server Data Services: • Workflows • EATME • Pro. X = kno wledg e
Part I: Data tracking Generate, Harvest and Filter
Generate
Global Sampling Event Fixed in Time Orchestrated June 21 st 2014 www. oceansamplingday. org Standardized Protocols Contexual Data Microbial Diversity & Function Legal Framework ABS, MTA, DTA
Ocean Sampling Day 4 Global 4 Standardized 4 Orchestrated 4 Sampling event fixed in time • June 21 st 2014 www. oceansamplingday. org
Information System: Process View + e t ra e n Ge Scientists = kno wledg e
Harvest
Ocean Sampling Day App Early, consistent, digital acquisition of environmental data https: //itunes. apple. com/us/app/osd-citizen/id 834353532? mt=8 https: //play. google. com/store/apps/details? id=com. iw. esa
Features 4 Allows to take data in the field • NO internet connection needed • GSC standards compliant
Entering Data
OSD-App-Server
OSD-App-Server
Login: Please Use Twitter, Facebook, or Google 4 Advantage • You do not need another password • We do not get your password Out of order Just works
Information System: Process View + e t ra e n Ge Scientists = kno wledg e
Filter
Data Analysis in Micro B 3 Frank Oliver 34
Frank Oliver 35
www. arb-silva. de/ngs
Information System: Process View + e t ra e n Ge Scientists = kno wledg e
Integrate
Heterogeneity: Oceanographic Data 39
ELT 40
Database Development 4 Post. BIS (Hamburg University) • Efficient storage and retrieval of DNA sequence data • <2 bits per nucleotide base • 500 x faster substring operation 4 rasdaman (Jacobs Unveristy) • Store and retrieve multi-dimensional raster data of unlimited size • Enhancements to SQL interface • http: //rasdaman. eecs. jacobsuniversity. de/trac/rasdaman 4 PANGAEA (MARUM/ University Bremen) • Lucene based search index
Information System: Process View + e t ra e n Ge Scientists = kno wledg e
Part II: Data Services Augment, Analyze and Interpret (Act)
Augment
Information System: Process View + e t ra e n Ge Scientists = kno wledg e
Analyse (ecologically)
FUNCTIONAL TRAIT-BASED ANALYSIS OF AQUATIC MICROBIAL COMMUNITIES
Functional Traits A functional trait is a well-defined, measurable property of organisms that strongly influences performance. • Direct link to ecosystem functioning • Ecological tradeoffs • What organisms • do, • how many types are needed to maintain ecosystem functioning Reiss et al. (2009)
Examples of Metagenomic Traits 4 GC (Guanine-Cytosine) content (mean and variance): • Related to genome size, environmental complexity and community composition. 4 Functional and phylogenetic diversity: • Related to metabolic potential, community composition and environmental biogeochemistry. 4 Dinucleotide frequency: • Related to phylogenetic composition. Explore community traits as ecological markers in microbial metagenomes. (Barberan, Fernandez et al. 2012).
The Metagenomic Trait Workflow(s) 4 Upstream: • Calculating traits (traits-analysis workflow) 4 Downstream • Calculating statistics (traits-statistics workflow) R scripts perform multivariate statistic analyses using the vegan package and plot the results using ggplot 2
What is a Workflow? 4 Describes what you want to do, rather than how you want to do it 4 Simple language specifies how processes fit together Predicted Genes out Sequence Repeat Masker Web service Gen. Scan Web Service Blast Web Service
What is a Taverna? 4 Workflow management system • Sophisticated analysis pipelines • A set of services to analyse or manage data (either local or remote) 4 Data flow through services 4 Control of service invocation
Taverna Workflows 4 Enhance • Interoperability • Integration • and Collaboration 4 Ease • Access to distributed and local resources • Automation of data flow • Provenance 4 Function: • Experimental protocols
Workflows can be good for… 4 High throughput analysis • Transcriptomics, proteomics, Next Gen sequencing 4 Data integration, data interoperation 4 Data management • Model construction • Data format manipulation • Database population
Taverna Workbench Workflow engine to run workflows List of services Construct and visualise workflows Web Services Scripts Programming libraries e. g. KEGG e. g. beanshell, R e. g. lib. SBML
“Thanks to the workflow now everybody can do it. ” http: //portal. biovel. eu/ Antonio Fernàndez-Guerra
Discovery: knowns, known unknowns and unknowns Cluster 1800572 Unknown unknown SAR 11_0487 Tryptophan synthase SAR 11_1266 hypothetical protein SAR 11_0686 hypothetical protein SAR 11_1277 aspartate racemase Pelagibacter ubique proteome centered subnetwork Antonio Fernandez, submitted
Information System: Process View + e t ra e n Ge Scientists = kno wledg e
Act Interpret
Complexity The real world is complex. Data reflects the real world and we have to deal with it.
Data Access: Software Services
Ecological Analysis Tools for Microbial Ecology (EATME)
Metagenomic Network Analysis Enable community of scientists to interact with the data Cluster 1800572 Unknown unknown SAR 11_0487 Tryptophan synthase SAR 11_1266 hypothetical protein SAR 11_0686 hypothetical protein SAR 11_1277 aspartate racemase
Data Access: Visualization of unknown networks
Pro. X 4 Master Thesis: Matthias Stock (Hochschule Bremen) 4 Efficient web-based and large-scale visualization of networks • Outperforms state of the art web tools
Information System: Process View EATME + e t ra e n Ge Scientists = kno wledg e
Information System: Process View Which data? How to process and analyze? What is the geographic and environmental distribution of my gene? EATME + e t ra e n Ge Scientists Data Tracking: • OSD App • OSD Server Data Services: • Workflows • EATME • Pro. X = kno wledg e
Take home messages 4 Information Systems • Integrated set of tools Keep the data flowing • Added value services • Cut down data preparation time and costs
Outro
Megx. net / Micro B 3 is Open Source 4 Subversion • https: //projects. mpi-bremen. de/micro-b 3/svn/ 4 Source Code Browser • https: //colab. mpi-bremen. de/source/ 4 Wiki • https: //colab. mpi-bremen. de/wiki 4 Issue Tracker • https: //colab. mpi-bremen. de/its/
Thanks for your attention http: //www. microb 3. eu http: //www. oceansamplingday. org http: //twitter. com/Micro_B 3 1 st Marine Board Forum: Marine data Challenges: from Observation to Information
knowns PFAM annotation of 53 GOS sampling sites (7523471 reads) 5653491 reads could have a PFAM assigned (15528086 hits) unknowns 6 -frame translation of 1869980 unknown reads (8884278 translated reads > 60 aa) Hierarchical clustering: 90%: 7681220 60%: 6689553 5759646 singletons removed 929907 unknowns 16 S r. DNA 9190 16 S r. DNA (7119 @ 97%) PFAM: 6903 (13672)Unknowns: 9925 (929907) 16 S r. DNA: 347 (7119) Global Ocean Sampling Expedition metagenomes 73 IV. Proof of concept
Network Analysis 4 Graphical Gaussian Model • Co-occurrence of unknown and known genes • Techniques similar to Web 2. 0 social network analysis
OSGi framework 4 Bundles (modules) 4 Execution environment 4 Application life cycle 4 Services • Service registry 4 Application share same JVM • Isolation/security
Components 4~ 20 components 4> 50 OSGi bundles • Should be devided in > 100
Guiding basic ecological questions • “Who is out there and where? ” In terms of sequenced genomes and key genes In terms of gene profiles • “What can they do? ” In terms of gene functions • “Under which environmental conditions? ” 4 information system, an integrated set of components for collecting, storing, and processing data and for delivering information, knowledge, and digital products. (http: //www. britannica. com/EBchecked/topic/287895/information-system, last visit 2013 -03 -13) 77
Megx. net: Data Portal for Microbial Ecological Genomi. X 4 Integrates geo-referenced data on • Bacterial-, archaeal-, phage- Genomes • Metagenomes, and • 16 S r. DNA based diversity data 4 Offers web based tools for visualization and analysis 4 http: //www. megx. net Kottmann et al. NAR. 2010
Who is out there and where? (in terms of sequenced genomes, metagenomes and key genes) Kottmann et al. NAR 2010
Micro B 3 Information System
Contextual Data Flow – Mobile App
Exploring Ecosystems Biology Knowledge x, y, z, t Key parameters Statistics Modelling Predictions
Acknowledgements 4 Micro B 3 Partners • Bremen: MPI, AWI, Marum, University Bremen, Jacobs University • WP Bioinformatics: EBI, Interworks, CNRS 4 Microbial Genomics Group • Frank Oliver Glöckner • Julia Schnetzer, Antonio Fernandez-Guerra, Michael Schneider • Pelin Yilmaz, Pier Luigi Buttigieg, Ivalyo Kostadinov 4 Genomic Standards Consortium
Micro B 3: Connected
Challenges in Environmental Bioinformatics 4 Data • Quantity • Complexity • Heterogeneity 85
Problems 4 Data processing 4 Data management/ Standardisation 4 Quality management 4 Data integration/ Modelling/Prediction 4 Access/Visualization
Data Integration: Marine Ecological Genomics Database (Meg. Db) Genomic Databases Environmental Databases EMBL CAMERA World Ocean Atlas Gen. Bank NCBI Genome Projects World Ocean Database DDBJ Ref. Seq Sea. Wi. FS Gold Moore Genomes Others Extract, Transform, Load Geo-referencing x = longitude y = latitude z = depth t = time
Types of Sequence Data 4 Genomic DNA • • Stores hereditary information Encodes information as a sequence of 4 different bases: Adenine, Thymine, Cytosine, Guanine Example: ACGATCGACTGAC • Alphabet size = 4, up to 15 • Lengths between few thousands and billions • Genomic DNA can be repetitive
Types of Sequence Data 4 Short Sequences • • • Short read DNA From 50 to 10, 000 bases long RNA Similar to short read DNA Protein Alphabet of 20 to 23! At maximum thousands long
Kilobyte per Day per Machine
Post. BIS: Sequence Data Compression 4 Master Thesis: Michael Schneider 4 Postgre. SQL extension • In-database sequence compression • Special Data Types • Special Functions
Post. BIS Performance Short again Genomic DNA Short Alignments
Post. BIS Performance
Post. BIS Performance
Substring Performance �
Substring Performance
- Slides: 96