From Informatics to Bioinformatics Limsoon Wong Laboratories for
From Informatics to Bioinformatics Limsoon Wong Laboratories for Information Technology Singapore
What is Bioinformatics?
Themes of Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases
Benefits of Bioinformatics To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science
From Informatics to Bioinformatics 8 years of bioinformatics R&D in Singapore Integration Technology (Kleisli) 1994 ISS MHC-Peptide Protein Interactions Binding Extraction (PIES) (PREDICT) Gene Expression Cleansing & & Medical Record Warehousing Datamining (PCL) (FIMM) Gene Feature Recognition (Dragon) 1996 Venom Informatics 1998 KRDL 2000 2002 LIT
Data Integration A DOE “impossible query”: For each gene on a given cytogenetic band, find its non-human homologs.
Data Integration Results • Using Kleisli: • Clear • Succinct • Efficient sybase-add (#name: ”GDB", . . . ); create view L from locus_cyto_location using GDB; create view E from object_genbank_eref using GDB; select #accn: g. #genbank_ref, #nonhuman-homologs: H from L as c, E as g, • Handles • heterogeneity • complexity {select u from g. #genbank_ref. na-get-homolog-summary as u where not(u. #title string-islike "%Human%") andalso not(u. #title string-islike "%H. sapien%")} as H where c. #chrom_num = "22” andalso g. #object_id = c. #locus_id andalso not (H = { });
Data Warehousing Motivation efficiency availabilty “denial of service” data cleansing Requirements efficient to query easy to update. model data naturally {(#uid: 6138971, #title: "Homo sapiens adrenergic. . . ", #accession: "NM_001619", #organism: "Homo sapiens", #taxon: 9606, #lineage: ["Eukaryota", "Metazoa", …], #seq: "CTCGGCCTCGGGCGCGGC. . . ", #feature: { (#name: "source", #continuous: true, #position: [ (#accn: "NM_001619", #start: 0, #end: 3602, #negative: false)], #anno: [ (#anno_name: "organism", #descr: "Homo sapiens"), …] ), …)}
Data Warehousing Results Relational DBMS is insufficient because it forces us to fragment data into 3 NF. Kleisli turns flat relational DBMS into nested relational DBMS. It can use flat relational DBMS such as Sybase, Oracle, My. SQL, etc. to be its update-able complex object store. ! Log in oracle-cplobj-add (#name: "db", . . . ); ! Define table create table GP (#uid: "NUMBER", #detail: "LONG") using db; ! Populate table with Gen. Pept reports select #uid: x. #uid, #detail: x into GP from aa-get-seqfeat-general "PTP” as x using db; ! Map GP to that table create view GP from GP using db; ! Run a queryto get title of 131470 select x. #detail. #title from GP as x where x. #uid = 131470;
Epitope Prediction TRAP-559 AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
Epitope Prediction Results w Prediction by our ANN model for HLA-A 11 w w w 29 predictions 22 epitopes 76% specificity w Prediction by BIMAS matrix for HLA-A*1101 Number of experimental binders 19 (52. 8%) 5 (13. 9%) 12 (33. 3%) 1 66 100 Rank by BIMAS
Transcription Start Prediction
Transcription Start Prediction Results
Medical Record Analysis w Looking for patterns that are w w valid novel useful understandable
Gene Expression Analysis w Classifying gene expression profiles w w w find stable differentially expressed genes find significant gene groups derive coordinated gene expression
Medical Record & Gene Expression Analysis Results § PCL, a novel “emerging pattern’’ method § Beats C 4. 5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks § Works well for gene expressions Cancer Cell, March 2002, 1(2)
Protein Interaction Extraction “What are the protein-protein interaction pathways from the latest reported discoveries? ”
Protein Interaction Extraction Results w Rule-based system for processing free texts in scientific abstracts w Specialized in w extracting protein names w extracting protein interactions
Behind the Scene w Vladimir Bajic w Vladimir Brusic w Jinyan Li w See-Kiong Ng w Limsoon Wong w Louxin Zhang w Allen Chong w Judice Koh w SPT Krishnan w Huiqing Liu w Seng Hong Seah w Soon Heng Tan w Guanglan Zhang w Zhuo Zhang and many more: students, folks from genetic. Xchange, Molecular. Connections, and other collaborators….
- Slides: 19