Micro B 3 Information System Bringing sequence data

Micro B 3 Information System Bringing sequence data into environmental context Microbial Genomics and Bioinformatics Research Group Renzo Kottmann rkottman@mpi-bremen. de @renzokott Hinxton, 2014 -03 -27

Ecosystem Perspective 2

Data Perspective genomes metagenomes transcriptomes latitude longitude marker genes proteomes collection date depth water currents temperature Omics Data Environmental Data

Data Perspective genomes metagenomes transcriptomes latitude longitude marker genes proteomes collection date depth water currents temperature Omics Data Environmental Data Result: Relationship

Data Flow Perspective genomes metagenomes transcriptomes marker genes proteomes Knowledge Study depth longitude water currents Field Web Access Omics Data Integration collection date latitude Environmental Data Result: Relationship Archival temperature Computing Laboratory

Data Flow Perspective: Issues genomes metagenomes transcriptomes marker genes proteomes Knowledge Study depth longitude water currents Field Web Access Omics Data Integration collection date latitude Environmental Data Quantity Heterogeneity Complexity Archival temperature Computing Laboratory

Data Integration genomes metagenomes transcriptomes marker genes proteomes collection date latitude Knowledge Study depth longitude water currents Field Web Access Omics Data temperature Environmental Data Integration + Analysis Integration Result: Relationship Archival Computing Laboratory

Data Integration: Geo-referencing genomes metagenomes transcriptomes marker genes proteomes y = latitude Knowledge Study t = collection date z = depth x = longitude water currents Field Web Access Omics Data temperature Environmental Data Integration + Analysis Integration Result: Relationship Archival Computing Laboratory

Micro B 3: Biodiversity, Bioinformatics, Biotechnology Knowledge Study Field Web Access Laboratory Integration Archival Computing

Micro B 3: Biodiversity, Bioinformatics, Biotechnology Micro B 3 Information System

Definition: Information System 4 information system, an integrated set of components for collecting, storing, and processing data and for delivering information, knowledge, and digital products. (http: //www. britannica. com/EBchecked/topic/287895/information-system, last visit 2013 -03 -13)

Information System: Logic View Collecting storing, and processing data and for delivering information modified from http: //martinfowler. com/articles/big. Data/

Information System: Process View modified from http: //martinfowler. com/articles/big. Data/

Information System: Process View – Data Convergence How to combine heterogeneous data? How to gain useful data? How to gather data? How to find relevant data?

Information System: Process View – Data Divergence How to enhance data? How to find relevant patterns? How to visualize and operationalize information for knowledge creation?

Information System: Science driven Which data? How to process and analyze? + e t ra e n Ge What is the geographic and environmental distribution of my gene? Scientists = kno wledg e How to visualize and operationalize information for knowledge creation?

So why all that? 4 To paraphrase Captain Kirk in the Star Trek: • “Data is a messy business— a very, very messy business. ” episode “A Taste of Armageddon” 4 “… as much as 60 percent of the time I spend on data analysis is focused on preparing the data for analysis. “ • R in Action: Data analysis and graphics with R by Robert I. Kabacoff

Gathering & Services Data Tracking Data Services 4 How to track the geographic 4 How to analyze, visualize - and environmental origin of and interpret the sequence DNA sequence data? data in an environmental context?

Information System: Science driven Which data? How to process and analyze? + e t ra e n Ge What is the geographic and environmental distribution of my gene? Scientists Data Tracking: • OSD App • OSD Server Data Services: • Workflows • EATME • Pro. X = kno wledg e

Part I: Data tracking Generate, Harvest and Filter

Generate

Global Sampling Event Fixed in Time Orchestrated June 21 st 2014 www. oceansamplingday. org Standardized Protocols Contexual Data Microbial Diversity & Function Legal Framework ABS, MTA, DTA

Ocean Sampling Day 4 Global 4 Standardized 4 Orchestrated 4 Sampling event fixed in time • June 21 st 2014 www. oceansamplingday. org

Information System: Process View + e t ra e n Ge Scientists = kno wledg e

Harvest

Ocean Sampling Day App Early, consistent, digital acquisition of environmental data https: //itunes. apple. com/us/app/osd-citizen/id 834353532? mt=8 https: //play. google. com/store/apps/details? id=com. iw. esa

Features 4 Allows to take data in the field • NO internet connection needed • GSC standards compliant

Entering Data

OSD-App-Server

OSD-App-Server

Login: Please Use Twitter, Facebook, or Google 4 Advantage • You do not need another password • We do not get your password Out of order Just works

Information System: Process View + e t ra e n Ge Scientists = kno wledg e

Filter

Data Analysis in Micro B 3 Frank Oliver 34

Frank Oliver 35

www. arb-silva. de/ngs

Information System: Process View + e t ra e n Ge Scientists = kno wledg e

Integrate

Heterogeneity: Oceanographic Data 39

ELT 40

Database Development 4 Post. BIS (Hamburg University) • Efficient storage and retrieval of DNA sequence data • <2 bits per nucleotide base • 500 x faster substring operation 4 rasdaman (Jacobs Unveristy) • Store and retrieve multi-dimensional raster data of unlimited size • Enhancements to SQL interface • http: //rasdaman. eecs. jacobsuniversity. de/trac/rasdaman 4 PANGAEA (MARUM/ University Bremen) • Lucene based search index

Information System: Process View + e t ra e n Ge Scientists = kno wledg e

Part II: Data Services Augment, Analyze and Interpret (Act)

Augment

Information System: Process View + e t ra e n Ge Scientists = kno wledg e

Analyse (ecologically)

FUNCTIONAL TRAIT-BASED ANALYSIS OF AQUATIC MICROBIAL COMMUNITIES

Functional Traits A functional trait is a well-defined, measurable property of organisms that strongly influences performance. • Direct link to ecosystem functioning • Ecological tradeoffs • What organisms • do, • how many types are needed to maintain ecosystem functioning Reiss et al. (2009)

Examples of Metagenomic Traits 4 GC (Guanine-Cytosine) content (mean and variance): • Related to genome size, environmental complexity and community composition. 4 Functional and phylogenetic diversity: • Related to metabolic potential, community composition and environmental biogeochemistry. 4 Dinucleotide frequency: • Related to phylogenetic composition. Explore community traits as ecological markers in microbial metagenomes. (Barberan, Fernandez et al. 2012).

The Metagenomic Trait Workflow(s) 4 Upstream: • Calculating traits (traits-analysis workflow) 4 Downstream • Calculating statistics (traits-statistics workflow) R scripts perform multivariate statistic analyses using the vegan package and plot the results using ggplot 2

What is a Workflow? 4 Describes what you want to do, rather than how you want to do it 4 Simple language specifies how processes fit together Predicted Genes out Sequence Repeat Masker Web service Gen. Scan Web Service Blast Web Service

What is a Taverna? 4 Workflow management system • Sophisticated analysis pipelines • A set of services to analyse or manage data (either local or remote) 4 Data flow through services 4 Control of service invocation

Taverna Workflows 4 Enhance • Interoperability • Integration • and Collaboration 4 Ease • Access to distributed and local resources • Automation of data flow • Provenance 4 Function: • Experimental protocols

Workflows can be good for… 4 High throughput analysis • Transcriptomics, proteomics, Next Gen sequencing 4 Data integration, data interoperation 4 Data management • Model construction • Data format manipulation • Database population

Taverna Workbench Workflow engine to run workflows List of services Construct and visualise workflows Web Services Scripts Programming libraries e. g. KEGG e. g. beanshell, R e. g. lib. SBML

“Thanks to the workflow now everybody can do it. ” http: //portal. biovel. eu/ Antonio Fernàndez-Guerra

Discovery: knowns, known unknowns and unknowns Cluster 1800572 Unknown unknown SAR 11_0487 Tryptophan synthase SAR 11_1266 hypothetical protein SAR 11_0686 hypothetical protein SAR 11_1277 aspartate racemase Pelagibacter ubique proteome centered subnetwork Antonio Fernandez, submitted

Information System: Process View + e t ra e n Ge Scientists = kno wledg e

Act Interpret

Complexity The real world is complex. Data reflects the real world and we have to deal with it.

Data Access: Software Services

Ecological Analysis Tools for Microbial Ecology (EATME)

Metagenomic Network Analysis Enable community of scientists to interact with the data Cluster 1800572 Unknown unknown SAR 11_0487 Tryptophan synthase SAR 11_1266 hypothetical protein SAR 11_0686 hypothetical protein SAR 11_1277 aspartate racemase

Data Access: Visualization of unknown networks

Pro. X 4 Master Thesis: Matthias Stock (Hochschule Bremen) 4 Efficient web-based and large-scale visualization of networks • Outperforms state of the art web tools

Information System: Process View EATME + e t ra e n Ge Scientists = kno wledg e

Information System: Process View Which data? How to process and analyze? What is the geographic and environmental distribution of my gene? EATME + e t ra e n Ge Scientists Data Tracking: • OSD App • OSD Server Data Services: • Workflows • EATME • Pro. X = kno wledg e

Take home messages 4 Information Systems • Integrated set of tools Keep the data flowing • Added value services • Cut down data preparation time and costs

Outro

Megx. net / Micro B 3 is Open Source 4 Subversion • https: //projects. mpi-bremen. de/micro-b 3/svn/ 4 Source Code Browser • https: //colab. mpi-bremen. de/source/ 4 Wiki • https: //colab. mpi-bremen. de/wiki 4 Issue Tracker • https: //colab. mpi-bremen. de/its/

Thanks for your attention http: //www. microb 3. eu http: //www. oceansamplingday. org http: //twitter. com/Micro_B 3 1 st Marine Board Forum: Marine data Challenges: from Observation to Information


knowns PFAM annotation of 53 GOS sampling sites (7523471 reads) 5653491 reads could have a PFAM assigned (15528086 hits) unknowns 6 -frame translation of 1869980 unknown reads (8884278 translated reads > 60 aa) Hierarchical clustering: 90%: 7681220 60%: 6689553 5759646 singletons removed 929907 unknowns 16 S r. DNA 9190 16 S r. DNA (7119 @ 97%) PFAM: 6903 (13672)Unknowns: 9925 (929907) 16 S r. DNA: 347 (7119) Global Ocean Sampling Expedition metagenomes 73 IV. Proof of concept

Network Analysis 4 Graphical Gaussian Model • Co-occurrence of unknown and known genes • Techniques similar to Web 2. 0 social network analysis

OSGi framework 4 Bundles (modules) 4 Execution environment 4 Application life cycle 4 Services • Service registry 4 Application share same JVM • Isolation/security

Components 4~ 20 components 4> 50 OSGi bundles • Should be devided in > 100

Guiding basic ecological questions • “Who is out there and where? ” In terms of sequenced genomes and key genes In terms of gene profiles • “What can they do? ” In terms of gene functions • “Under which environmental conditions? ” 4 information system, an integrated set of components for collecting, storing, and processing data and for delivering information, knowledge, and digital products. (http: //www. britannica. com/EBchecked/topic/287895/information-system, last visit 2013 -03 -13) 77

Megx. net: Data Portal for Microbial Ecological Genomi. X 4 Integrates geo-referenced data on • Bacterial-, archaeal-, phage- Genomes • Metagenomes, and • 16 S r. DNA based diversity data 4 Offers web based tools for visualization and analysis 4 http: //www. megx. net Kottmann et al. NAR. 2010

Who is out there and where? (in terms of sequenced genomes, metagenomes and key genes) Kottmann et al. NAR 2010

Micro B 3 Information System

Contextual Data Flow – Mobile App

Exploring Ecosystems Biology Knowledge x, y, z, t Key parameters Statistics Modelling Predictions

Acknowledgements 4 Micro B 3 Partners • Bremen: MPI, AWI, Marum, University Bremen, Jacobs University • WP Bioinformatics: EBI, Interworks, CNRS 4 Microbial Genomics Group • Frank Oliver Glöckner • Julia Schnetzer, Antonio Fernandez-Guerra, Michael Schneider • Pelin Yilmaz, Pier Luigi Buttigieg, Ivalyo Kostadinov 4 Genomic Standards Consortium

Micro B 3: Connected

Challenges in Environmental Bioinformatics 4 Data • Quantity • Complexity • Heterogeneity 85

Problems 4 Data processing 4 Data management/ Standardisation 4 Quality management 4 Data integration/ Modelling/Prediction 4 Access/Visualization

Data Integration: Marine Ecological Genomics Database (Meg. Db) Genomic Databases Environmental Databases EMBL CAMERA World Ocean Atlas Gen. Bank NCBI Genome Projects World Ocean Database DDBJ Ref. Seq Sea. Wi. FS Gold Moore Genomes Others Extract, Transform, Load Geo-referencing x = longitude y = latitude z = depth t = time

Types of Sequence Data 4 Genomic DNA • • Stores hereditary information Encodes information as a sequence of 4 different bases: Adenine, Thymine, Cytosine, Guanine Example: ACGATCGACTGAC • Alphabet size = 4, up to 15 • Lengths between few thousands and billions • Genomic DNA can be repetitive

Types of Sequence Data 4 Short Sequences • • • Short read DNA From 50 to 10, 000 bases long RNA Similar to short read DNA Protein Alphabet of 20 to 23! At maximum thousands long

Kilobyte per Day per Machine

Post. BIS: Sequence Data Compression 4 Master Thesis: Michael Schneider 4 Postgre. SQL extension • In-database sequence compression • Special Data Types • Special Functions

Post. BIS Performance Short again Genomic DNA Short Alignments

Post. BIS Performance

Post. BIS Performance

Substring Performance �

Substring Performance
- Slides: 96