From Biological Data to Biological Knowledge Volker Stmpflen
From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF – National Research Center for Environment and Health TMRA 06
Something About Our Problem Ø Ø For a long time we focused on individual genes / proteins … Ø … but e. g. humans don’t have much more genes than “simple” organisms … Ø … because complexity occurs at the level of biological networks We can’t understand anything without understanding the context TMRA 06
Small Scale „Knowledge Generation“ Ø Accessing some of the several hundred (web) resources (public available data > 2 Petabyte) => Compilation of required knowledge by hand TMRA 06
Large Scale Assessment of Information and Knowledge Ø R. Shamir et. al. , Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data, PNAS, Vol. 101, No. 9, 2004, p. 2981 -2986 l l “To gain deeper understanding of the [biological] systems, it is pertinent to analyze heterogeneous data sources in a truly integrated fashion and shape the analysis results into one body of knowledge. ” “By integrating experimental data of heterogeneous sources and types, we are able to perform analysis on a much broader scope than previous studies. ” TMRA 06
Technical Problems Ø Information integration from heterogeneous and distributed data sources (databases AND applications) Ø Solvable with n-Tier architectures l l l E. g. Gen. RE at MIPS J 2 EE based middleware Enterprise Java Beans (EJBs) and Web Services (WS) TMRA 06
Semantic Problems Sloppy Definitions: e. g. Gene has Function Ø Homonym / Synonym problems e. g. gene identifiers Ø Ambiguity of terms Ø Differences in meaning of terms between different biological communities Ø Results of in-vitro often differ within the experimental scope (e. g. Protein Interactions) Ø… Ø TMRA 06
Strategies Ø Complete semantic annotation of (all) resources l l Funding ? Data models ? Ø Modeling of individual domains l l l Suited for biologists (Topic Maps) Access of relevant data sources Merging of individual domains to obtain the “complete picture” TMRA 06
Static Generation of Topic Maps Extract TM 4 J XTM File Extract + + Ø Highly flexible data model Straightforward process Intuitive user interface Finding the right information easy Topic maps tend to be very large Redundant information in DBs and Topic Map files Update problems Dynamic generation of topic maps TMRA 06 Omnigator
Dynamic Topic Map Generation Ø Dynamical information retrieval via EJBs / Web Services l l Each topic type is mapped to a EJBs / Web Service Each association is also represented by a EJBs / Web Service Protein – ECNum Association Protein Web Services Ø EC Number is associated to EC Number Web Services Protein – ECNum Association Web Services Straightforward extension of the data model l Ø has Afterwards user's adjustments are possible Intuitive navigation of related information TMRA 06
Interface Definition Ø Information retrieval via EJBs (Web Service) l l Ø Each topic type is mapped to a EJB / WS Each association type is also represented by a EJB / WS Straightforward extension of the data model l Afterwards user's adjustments are possible … … TMRA 06
DTMG Architecture (Extension of Gen. RE) Other Types of Clients XSL Transformation Protein. Topic. Type EJB Protein Extractor GISE EJB … … GISE EJB Arabidopsis thaliana Protein. Pfam Ass. Type EJB Topic. Map Manager Protein. Pfam Extractor … GISE EJB Semantic Tier Syntax Tier Resource Manager SIMAP Access EJB … PEDANT DBs Web Presentation Tier Integration Tier Enterprise Information System Tier SIMAP Fun. Cat K. Nenova TMRA 06
“Worst Case” Example Ø Combination of two large resources at MIPS l l Annotated Proteins: Calculated properties of genes / proteins from various organisms Orthologs: Calculated similarities of proteins (all against all) K. Nenova / R. Gregory TMRA 06
Large Scale Annotation with PEDANT (Protein Extraction, Description and Analysis Tool) Ø Ø Covers currently > 400 genomes ~ 1000 end of this year TMRA 06
SIMAP: Precalculated Sequence Homologies SIMAP database NFS-Server Grid Master • 450 proteoms • 4 sequence collections • 7. 5 million protein entries • 3. 5 million sequences LAN Grid execution hosts External users: MIPS + WWW users Webserver 8 billion FASTA hits SIMAP client Linux Internet BOINC daemons SIMAP database Database-, Fileserver Windows BOINC core Mac BOINC: • 12600 hosts • 2. 3 Tera. FLOPS R. Arnold, T. Rattei, P. Tischler, V. Stümpflen, M-D. Truong and HW. Mewes; Bioinformatics in press TMRA 06
Topic Map Schema is represented by Description Length Molecular Weight Sequence Classification Pedant URL Contig Name Description contains PFAM Domain Genome Protein belongs Pfam URL has UR L belongs is represented by Description KEGG URL has orthologs Domain Genome Fun Cat Fun. Ca t URL EC Number is associated to Description Status TMRA 06 Strain Taxonomy Id
Some Screenshots Context TMRA 06
Improvements Ø Parallel searches based on Message Driven Beans 1 Search Queue Request Message Response Queue Pedant DB 1 Message Driven Bean Pedant DB 2 Message Driven Bean Pedant DB 3 Message Driven Bean Pedant DB n 2 Stateless Session Bean 4 Message Driven Bean 3 Response Message Database Connection R. Gregory TMRA 06
Further Improvements Ø More Maps l Deseases, Metabolisms Combination with Text Mining Ø Inference Engines, Reasoners Ø … Ø Computer: Show me all proteins in musculus involved in transmembrane signal transduction and show me the orthologs in rattus norvegicus TMRA 06
Conclusion Topic Maps suitable for semantic information integration Ø Development of a Dynamic Topic Map Generation (DTMG) Framework Ø Generation of fragments based on component and service oriented architectures Ø Capable to gain deeper understanding of biological entities and systems in a truly integrated fashion Ø TMRA 06
Acknowledgements Ø Filka Nenova Richard Gregory Matthias Oesterheld Roland Arnold Octave Noubibou Marisa Thoma Konrad Schreiber … Ø Thomas Rattei Ø Ulrich Güldener Martin Münsterkötter Ø Funding Impuls- und Vernetzungsfonds der Helmholtz-Gemeinschaft Deutscher Forschungszentren e. V. TMRA 06
- Slides: 20