Semantic Web and Cloud Computing for Integrative Data

Semantic Web and Cloud Computing for Integrative Data Management and Analysis Infrastructure Jonas Almeida Professor @ Dept Bioinformatics and Computational Biology Laboratory of Integrative Bioinformatics

RDF metadata linking URIs of raw data, processed data and processing services Processing Data acquisition Computational Ecosystem Data analysis REST protocol Stores Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

RDF metadata linking URIs of raw data, processed data and processing services Computational Ecosystem Syntactic interoperability REST, S(W)OA and Cloud computing • Organic development of analytical software applications integrated with other initiatives/resources. • Programmatic interoperability by exposing API through REST. Semantic Interoperability • Interoperability with legacy systems because they are special realizations of more generic RDF based abstractions. REST RDF bus (Resource Description Framework) protocol Merged representation of data structures and workflows Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

</ </ < < < > > > </ </ > > Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

</ </ < < < > > > </ </ > > Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

(T-Box) Rules (A-Box) Statements rel 0 rel 1 rel 2 rel 1 rel 3 rel 6 rel 4 rel 5 rel 1 rel 6 rel 3 rel 0 rel 1 rel 6 rel 5 rel 1 rel 3 rel 1 Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

RDF - everything is a resource Wang X, R Gorlitsky, and JS Almeida (2005) From XML to RDF: How Semantic Web Technologies Will Change the Design of ‘Omic’ Standards. Nature Biotechnology, Sep; 23(9): 1099 -103 [PMID: 16151403]. Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

A brief history of data XML structure Flat text file RDF triples TXTXML RDF Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Wisdom Knowledge n 3 logic Information OWL, OBO W 3 C RDF SPARQL XML Data TEXT Files Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

SPARQL Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

RESTfull web services, Distinction of domain from instantiation CLOUD COMPUTING Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

S 3 DB S CLOUD COMPUTING S S S Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

client side applications ? RDF-bus RDF SPARQL reasoner http ide r-s e v Ser client side services Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

ser s 3 db: operator s 3 db: UU (12) (5) I ollection R te (7) (6) a ic d e pr : R b d s 3 (8) s 3 db: Spredicate ule [Csubj] [Ipred] [Cobj or Literal] A ttribute (9) tem S s 3 db: Sobject PR U U (3) : db s 3 : D db s 3 (11) C s 3 db: Ssubject P roject eployment (4) s 3 db: CI s 3 db: Robject D (2) s 3 db: PC s 3 db: Rsubject (1) s 3 db: DP (10) tatement [Isubj] [Rpred] [I or Literal] V alue S 3 db: Statement S 3 db: Rule Isubj {Csubj Ipred {Cobj or Literal}. } s 3 db: CI s 3 db: collection {Iobj or Literal}. s 3 db: CI Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

(5) s 3 db: operator ser s 3 db: UU (12) R te (7) (6) P roject eployment U ser Needed only if sharing with Project that is hosted by a distinct S 3 DB Deployment. a ic d e pr (9) : R b d s 3 (8) s 3 db: Spredicate S ule [Csubj] [Ipred] [Cobj or D I ollection C Literal] I ule S (10) tatement [Isubj] [Rpred] [I or ollection R tem s 3 db: Sobject PR U U (3) : db s 3 : D db s 3 (11) C roject s 3 db: Ssubject P eployment (4) s 3 db: CI s 3 db: Rsubject D (2) s 3 db: PC s 3 db: Robject (1) s 3 db: DP Literal] tem tatement Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

S 3 DB S CLOUD COMPUTING S S S Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

S 3 DB Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

S 3 DB SPARQL S 3 QL Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

S 3 DB 10 1 GUI 9 8 API SPARQL S 3 QL 2 index 3 http(s): DB 9 2 8 index API S 3 QL 3 DB S Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Almeida JS, C Chen, R Gorlitsky, R Stanislaus, M Aires-de-Sousa, P Eleutério, JA Carriço, A Maretzek, A Bohn, A Chang, F Zhang, R Mitra, GB Mills, X Wang, HF Deus Data integration gets 'Sloppy'. Nature Biotechnology 2006 24(9): 1070 -1071. [PMID: 16964209]. GUI API API Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Almeida JS et. al (2006) Nature Biotechnology 24(9): 1070 -1071. Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Snapshots of interfaces using S 3 DB’s API (Application Programming Interface). These applications exemplify why the semantic web designs can be particularly effective at enabling generic tools to assist users in exploring data documenting very specific and very complex relationships. Snapshot A was taken from S 3 DB’s web interface, which is included in the downloadable package. This interface was developed to assist in managing the database model and, therefore, is centered on the visualization and manipulation of the domain of discourse, its Collections of Items and Rules defining the documentation of their relations. The application depicted on snapshots B-D describe a document management tool S 3 DBdoc, freely available as a Bioinformatics Station module (see Figure 6). The navigation is performed starting from the Project (C), then to the Collection (B) and finally to the editing of the Statements about an Item (D). The snapshot B illustrates an intermediate step in the navigation where the list of Items (in this case samples assayed by tissue arrays, for which there is clinical information about the donor) is being trimmed according to the properties of a distant entity, Age at Diagnosis, which is a property of the Clinical Information Collection associated with the sample that originated the array results. This interaction would have been difficult and computationally intensive to manage using a relational architecture. The RDF formatted query result produced by the API was also visualized using a commercial tool, Sentient Knowledge Explorer (IO-Informatics Inc), shown in snapshot E, and by Welkin, F, developed by the digital inter-operability SIMILE project at the Massachusetts Institute of Technology. See text for discussion of graphic representations by these tools. To protect patient confidentiality some values in snapshots B and D are scrambled and numeric sample and patient identifiers elsewhere altered. Jonas Almeida @ Univ Texas are MDAnderson Cancer Ctr PLo. S ONE. Aug 13; 3(8): e 2946

Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

#111. Freire P, M Vilela, HF Deus, YW Kim, D Koul, H Colman, KD Aldape, O Bogler, WKA Yung, K Coombes, GB Mills, AT Vasconcelos, JS Almeida (2008) Exploratory Analysis of the Copy Number Alterations in Glioblastoma Multiforme. PLo. S ONE 3(12): e 4076 doi: 10. 1371/journal. pone. 0004076. [PMID 19115005]. External Resources Data Analysis Applications Web Services Ca. BIG interoperable initiatives Bi. S S 3 DB Ca. Integrator QAQC + Assembly of data structures for efficient support of asynchronous GUIs Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Bioinformatic. Station. org Bi. S Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr
![The path backwards. Model ID Models, transfer functions [ y= f(x) ] Variable Selection The path backwards. Model ID Models, transfer functions [ y= f(x) ] Variable Selection](http://slidetodoc.com/presentation_image_h2/0d71cc103365ea5947562114c597de11/image-36.jpg)
The path backwards. Model ID Models, transfer functions [ y= f(x) ] Variable Selection Boosting, evolutionary algorithms, exhaustive search [x X] Discovery Self-described structures, Ontologies, RDF, Description Logic, S 3 DB. [ x [X, Z] ] Models ------------ Tools ----------------- Software Environment Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

2/5 Lesson learned: predictive independent variables are a needle in the haystack. #14. Almeida, J. S. , M. A. M. Reis, M. J. T. Carrondo (1997) A Novel Unifying Kinetic Model of Denitrification. J. Theor. Biol. 186: 241 -249. [doi: 10. 1006/jtbi. 1996. 0352] Model ID #31. Wolf G. Almeida JS. Pinheiro C. Correia V. Rodrigues C. Reis MAM. Crespo JG. (2001) Two-dimensional fluorometry coupled with artificial neural networks: A novel method for on-line monitoring of complex biological processes. Biotechnology & Bioengineering. 72(3): 297 -306. [PMID: 11135199] #36. Almeida, JS (2002) Predictive non-linear modeling of complex data by artificial neural networks. Curr. Op. Biotech. 13(1) 72 -76. [PMID: 11849962] #68. Mikhitarian, K. , Gillanders, W. E. , Almeida, J. S. , Hebert Martin R. , Varela J. C. , Metcalf, J. S. , Cole, D. J. , and Mitas, M. (2005) An innovative microarray strategy identities informative molecular markers for the detection of micrometastatic breast cancer. Clinical Cancer Research 11(10): 3697 -704. [PMID: 15897566] #72. Almeida JS, DJ Mc. Killen, YA Chen, PS Gross, RW Chapman, G Warr (2005) Design and Calibration of Microarrays as Universal Transcriptomic Environmental Biosensors. Comparative and Functional Genomics, 6(3): 132 -137(6). [doi: 10. 1002/cfg. 466]. #77. Garcia S. P. , Jonas S. Almeida, JS (2005) Multivariate phase space reconstruction by nearest neighbor embedding with different time delays, Physical Review E 72, 027205. [PMID: 16196759]. #78. Oates JC, Varghese S, Bland AM, Taylor TP, Self SE, Stanislaus R, Almeida JS, Arthur JM (2005) Prediction of urinary protein markers in lupus nephritis. Kidney Int. Dec; 68(6): 2588 -92 [PMID: 16316334]. #86. Geli P, P Rolghamre, JS Almeida, K Ekdahl (2006) Modeling Pneumococcal Resistance to Penicillin in Southern Sweden Using Artificial Neural Networks. Microbial Drug Resistance 12(3): 149 -157. [PMID: 17002540] #95. Wolf G, JS Almeida, JG Crespo, MA Reis (2007) An improved method for two-dimensional fluorescence monitoring of complex bioreactors. J Biotechnol. 128(4): 801 -12. [PMID: 17291616]. #103. Sá-Leão R, Nunes S, Brito-Avô A, Alves CR, Carriço JA, Saldanha J, Almeida JS, Santos-Sanches I, de Lencastre H. (2008) High rates of transmission of and colonization by Streptococcus pneumoniae and Haemophilus influenzae within a day care center revealed in a longitudinal study. J Clin Microbiol. Jan; 46(1): 225 -34. [PMID: 18003797] Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

3/5 Model ID Variable Selection #63. Almeida JS, R Stanislaus, E Krug, J Arthur (2005) Normalization and Analysis of residual variation in 2 D Gel Electrophoresis for quantitative differential proteomics. Proteomics 5(5): 1242 -9 [PMID: 15732138]. # 64. Mitas M, JS Almeida, K Mikhitarian, WE Gillanders, DN Lewin, DD Spyropoulos, L Hoover, A Graham, T Glenn, P King, DJ Cole, R Hawes, CE Reed, BJ Hoffman (2005) Accurate discrimination of Barrett’s esophagus and esophageal adenocarcinoma using a quantitative three-tiered algorithm and multi-marker real-time RT-PCR. Clin Cancer Res. 2005 Mar 15; 11(6): 2205 -14 [PMID: 15788668]. #83. Mueller M, Wagner CL, Annibale DJ, Knapp RG, Hulsey TC, Almeida JS (2006) Parameter selection for and implementation of a web-based decision-support tool to predict extubation outcome in premature infants. BMC Medical Informatics and Decision Making 6: 11 [PMID: 16509967]. #87. Almeida JS, Oates JC, Arthur JM. (2006) The need for concurrent calibration and discrimination statistics in predictive models. Kidney Int. 70(1): 231 -2. [doi: 10. 1038/sj. ki. 5001519]. #89. Carrico JA, Silva-Costa C, Melo-Cristino J, Pinto FR, de Lencastre H, Almeida JS, Ramirez M. (2006) Illustration of a common framework for relating multiple typing methods by application to macrolide-resistant Streptococcus pyogenes. J Clin Microbiol. 44(7): 2524 -32. [PMID: 16825375]. #91. Almeida, J. S. , S. Vinga (2006) Computing distribution of scale independent motifs in biological sequences. Algorithms for Molecular Biology. 1: 18. [PMID: 17049089]. #96. Pinto FR, Carrico JA, Ramirez M, Almeida JS. (2007) Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement. BMC Bioinformatics 8(1): 44. [PMID: 17286861]. #102. Vinga S, Almeida JS. (2007) Local Renyi entropic profiles of DNA sequences. BMC Bioinformatics. 2007 Oct 16; 8(1): 393. [PMID: 17939871] Lesson learned: critical co-variables are often found in other haystacks. Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

4/5 Model ID Variable Selection Discovery #72. Almeida JS, DJ Mc. Killen, YA Chen, PS Gross, RW Chapman, G Warr (2005) Design and Calibration of Microarrays as Universal Transcriptomic Environmental Biosensors. Comparative and Functional Genomics, 6(3): 132 -137(6). [doi: 10. 1002/cfg. 466]. #76. Wang X, R Gorlitsky, and JS Almeida (2005) From XML to RDF: How Semantic Web Technologies Will Change the Design of ‘Omic’ Standards. Nature Biotechnology, Sep; 23(9): 1099 -103 [PMID: 16151403]. #90. Almeida JS, C Chen, R Gorlitsky, R Stanislaus, M Aires-de-Sousa, P Eleutério, JA Carriço, A Maretzek, A Bohn, A Chang, F Zhang, R Mitra, GB Mills, X Wang, HF Deus (2006) Data integration gets 'Sloppy'. Nature Biotechnology 24(9): 1070 -1071. [PMID: 16964209]. #104. Stanislaus R, JM Arthur, B Rajagopalan, R Moerschell, B Mc. Glothlen, JS Almeida (2008). An open-source representation for 2 -DE-centric proteomics and support infrastructure for data storage and analysis, BMC Bioinformatics. Jan 7; 9: 4. [PMID: 18179696] #108. Deus FH, R Stanislaus 1, DF Veiga, C Behrens, II Wistuba, JD Minna, HR Garner, SG Swisher, JA Roth, AM Correa, B Broom, K Coombes, A Chang, LH Vogel, JS Almeida (2008) A Semantic Web management model for integrative biomedical informatics. PLo. S ONE. Aug 13; 3(8): e 2946 [PMID: 18698353]. #118. Diogo F T Veiga, Helena F Deus, Caner Akdemir, Ana T R Vasconcelos and Jonas S Almeida (2009) DASMiner: discovering and integrating data from DAS sources. BMC Syst Biol. 2009 Nov 17; 3: 109. [PMID 19919683]. Lesson learned: more than domain specific models or tools, integrative research requires a Knowledge Engineering environment. The critical characteristic of that environment is semantic interoperability for both data and tools. Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr

This outcome was anticipated right at the onset of the Web [recall Tim Berners-Lee “weaving the web”] Key features of a web-based environment: 1. Syntactic interoperability Ability to get the data once told where it is. 2. Semantic interoperability Ability to use the data for a different purpose than the one that dictated its generation. Jonas Almeida @ Univ Texas MDAnderson Cancer Ctr
- Slides: 40