Data integration for candidate gene prioritization Yves Moreau

  • Slides: 81
Download presentation
Data integration for candidate gene prioritization Yves Moreau Computational Systems Biology

Data integration for candidate gene prioritization Yves Moreau Computational Systems Biology

Beyond the hairball n n Networks have become a central concept in biology Initial

Beyond the hairball n n Networks have become a central concept in biology Initial top-down analyses of omics data resulted in hairball description of gene or protein networks n High-level properties n n n Scale-free network But what do we do with this? Which methods are available to get actual biological predictions from these multiple sources of data? Yeast protein-protein interaction network Jeong H. et al. Nature. 2001 2

Omics data n n n n Many other sources of omics information and data

Omics data n n n n Many other sources of omics information and data are available to help us identify the most interesting candidates for further study Ch. IP chip Regulatory motifs Protein motifs Microarray compendia (Oncomine, Array. Express, GEO) Protein-protein interaction Gene Ontology KEGG 3

Genome browsers n UCSC genome browser Ensembl n Federate many other information sources n

Genome browsers n UCSC genome browser Ensembl n Federate many other information sources n genome. ucsc. edu www. ensembl. org 4

Gene Ontology n Gene Ontology www. geneontology. org 5

Gene Ontology n Gene Ontology www. geneontology. org 5

Pathways n Many databases of pathways: KEGG, Gen. MAPP, a. MAZE, etc. 6

Pathways n Many databases of pathways: KEGG, Gen. MAPP, a. MAZE, etc. 6

Protein-protein interaction n Large databases of protein-protein interactions are becoming available n n n

Protein-protein interaction n Large databases of protein-protein interactions are becoming available n n n Yeast two-hybrid Coimmunoprecipitation Data is getting cleaned and merged across organisms n Ulysses www. cisreg. ca n Hi. MAP www. himap. org 7

Microarray compendia n Multiple large microarray data sets (compendia) are available that give a

Microarray compendia n Multiple large microarray data sets (compendia) are available that give a broad overview of general biological processes in different organisms n n n Su et al. , Son et al. , human and mouse tissues Hughes et al. , yeast mutants Gasch et al. , yeast stress At. Gen. Express, CAGE, Arabidopsis Available through microarray repositories n n Array. Express Gene Expression Omnibus 8

Literature abstracts n Pub. Med Pub. Gene n Entrez. Gene. RIF www. ncbi. nlm.

Literature abstracts n Pub. Med Pub. Gene n Entrez. Gene. RIF www. ncbi. nlm. nih. gov/entrez/ n Pub. Gene www. pubgene. org Gene. RIF 9

Multisource networks n n n Some tools integrate multiple types of data to browse

Multisource networks n n n Some tools integrate multiple types of data to browse a network of genes Bio. PIXIE (yeast) pixie. princeton. edu STRING string. embl. de STRING BIOPIXIE 10

So much data. . . So little time. . . 11

So much data. . . So little time. . . 11

Candidate gene prioritization 12

Candidate gene prioritization 12

Human genetics identifies key genes in monogenic and multifactorial diseases Patients with congenital &

Human genetics identifies key genes in monogenic and multifactorial diseases Patients with congenital & acquired disorders CGH microarrays Molecular karyotyping Databasing Statistical analysis Location of chromosomal imbalances • Map chromosomal abnormalities • Improved diagnosis Discover new disease causing genes and explain their function Prioritized candidate genes 13

Candidate gene prioritization High-throughput genomics Data analysis Information sources Candidate prioritization Validation Candidate genes

Candidate gene prioritization High-throughput genomics Data analysis Information sources Candidate prioritization Validation Candidate genes ? • Identify key genes and their function • Emerging method • Integration of multiple types of information 14

Prioritization by text mining 15

Prioritization by text mining 15

Prioritization by text mining ABLIM 1 ACSL 5 ADD 3 ADRA 2 A ADRB

Prioritization by text mining ABLIM 1 ACSL 5 ADD 3 ADRA 2 A ADRB 1 CASP 7 CSPG 6 DCLRE 1 A DUSP 5 GFRA 1 GPAM GSTO 1 HABP 2 HSPA 12 A MXI 1 NHLRC 2 NRAP PDCD 4 PNLIPRP 1 RBM 20 SHOC 2 SLK SMNDC 1 SORCS 1 TCF 7 L 2 TDRD 1 TECTB TRUB 1 VTI 1 A VWA 2 XPNPEP 1 ZDHHC 6 Microcephaly Micrognathia Low-set ears Microphthalmia Downslanting palpebral fissures Hypertelorism Long philtrum Cleft lip Short neck Pectus excavatum Syndactyly Heart defects Cryptorchidism Mental retardation 16

Prioritization by text mining ABLIM 1 ACSL 5 ADD 3 ADRA 2 A ADRB

Prioritization by text mining ABLIM 1 ACSL 5 ADD 3 ADRA 2 A ADRB 1 CASP 7 CSPG 6 DCLRE 1 A DUSP 5 GFRA 1 GPAM GSTO 1 HABP 2 HSPA 12 A MXI 1 NHLRC 2 NRAP PDCD 4 PNLIPRP 1 RBM 20 SHOC 2 SLK SMNDC 1 SORCS 1 TCF 7 L 2 TDRD 1 TECTB TRUB 1 VTI 1 A VWA 2 XPNPEP 1 ZDHHC 6 Microcephaly Micrognathia Low-set ears Microphthalmia Downslanting palpebral fissures Hypertelorism Long philtrum Cleft lip Short neck Pectus excavatum Syndactyly Heart defects Cryptorchidism Mental retardation 17

18

18

Gene to concept association ENSG 000001 ENSG 000002. . . ENSG 00000109685. . .

Gene to concept association ENSG 000001 ENSG 000002. . . ENSG 00000109685. . . ENSG 00000024999 ENSG 00000025000 Microcephaly 19

Gene to concept association ENSG 000001 ENSG 000002. . . ENSG 00000109685. . .

Gene to concept association ENSG 000001 ENSG 000002. . . ENSG 00000109685. . . ENSG 00000024999 ENSG 00000025000 Microcephaly overrepresented in document set for WHSC 1 gene 20

21

21

Prioritization by virtual pulldown 22

Prioritization by virtual pulldown 22

Prioritization by virtual protein-protein interaction pulldown and text mining 23

Prioritization by virtual protein-protein interaction pulldown and text mining 23

24

24

Can the candidate be assigned to a protein complex? 25

Can the candidate be assigned to a protein complex? 25

Are there any proteins involved in diseases similar to the patient phenotype in the

Are there any proteins involved in diseases similar to the patient phenotype in the complex? 26

How many? How similar? 27

How many? How similar? 27

28

28

29

29

Prioritization by example 30

Prioritization by example 30

Prioritization by example n Several cardiac abnormalities mapped to 3 p 22 -25 n

Prioritization by example n Several cardiac abnormalities mapped to 3 p 22 -25 n n Candidate genes (“test set”) n n 3 p 22 -25, 210 genes Known genes (“training set”) n n n Atrioventricular septal defect Dilated cardiomyopathy Brugada syndrome 10 -15 genes: NKX 2. 5, GATA 4, TBX 5, TBX 1, JAG 1, THRAP, CFC 1, ZFPM 2, PTPN 11, SEMA 3 E Congenital heart defects (CHD) High scoring genes n n n ACVR 2, SHOX 2 - linked to heterotaxy and Turner syndrome (often associated with CHD) Plexin-A 1 - reported as essential for chick cardiac morphogenesis Wnt 5 A, Wnt 7 A – neural crest guidance 31

Multiple sources of information Annotations A-priori Data fusion Interactions Vectors 32

Multiple sources of information Annotations A-priori Data fusion Interactions Vectors 32

Endeavour http: //www. esat. kuleuven. ac. be/endeavour 33

Endeavour http: //www. esat. kuleuven. ac. be/endeavour 33

Endeavour http: //www. esat. kuleuven. ac. be/endeavour 34

Endeavour http: //www. esat. kuleuven. ac. be/endeavour 34

Endeavour http: //www. esat. kuleuven. ac. be/endeavour 35

Endeavour http: //www. esat. kuleuven. ac. be/endeavour 35

Endeavour architecture Java client & Java web start SOAP/XML Java My. SQL driver DB

Endeavour architecture Java client & Java web start SOAP/XML Java My. SQL driver DB Perl My. SQL driver Web server (Apache & Tomcat & axis) Java RMI Linux cluster (Perl scripts)

Data fusion with order statistics 37

Data fusion with order statistics 37

Training of an attribute submodel . . . Training gene 1 Training gene n

Training of an attribute submodel . . . Training gene 1 Training gene n n n . . . Term t Term 1 Annotations p-value Term 1 0. 00054 Term 4 0. 00072 … Term t … 0. 00457 A term is over-represented if its frequency inside the training set is significantly larger than its frequency over the genome Gene Ontology, Interpro, KEGG & EST submodels 38

Training of a vector submodel Vectors n n A collection of profiles (here numerical

Training of a vector submodel Vectors n n A collection of profiles (here numerical vectors) can be represented by the average profile Microarray, motif & text submodels 39

OMIM & GO cross-validation n Diseases Alzheimer’s disease, amyotrophic lateral sclerosis (ALS), anemia, breast

OMIM & GO cross-validation n Diseases Alzheimer’s disease, amyotrophic lateral sclerosis (ALS), anemia, breast cancer, cardiomyopathy, cataract, charcot-marie-tooth disease, colorectal cancer, deafness, diabetes, dystonia, Ehlers. Danlos, epilepsy, hemolytic anemia, ichthyosis, leukemia, lymphoma, mental retardation, muscular dystrophy, myopathy, neuropathy, obesity, Parkinson’s disease, retinitis pigmentosa, spastic paraplegia, spinocerebellar ataxia, usher syndrome, xeroderma pigmentosum, Zellweger syndrome Pathways n Wnt pathway members (GO: 0016055: Wnt receptor signaling pathway) n Notch pathway members (GO: 0007219: Notch signaling pathway) n EGFR pathway members (GO: 0007173: epidermal growth factor receptor signaling pathway) n n 40

Cross-validation Repeat • For each gene • For each disease or pathway Compute average

Cross-validation Repeat • For each gene • For each disease or pathway Compute average rank 41

Rank ROC curves 42

Rank ROC curves 42

Novel Di. George candidate n n n D. Lambrechts, P. Carmeliet, KUL Cardiovascular Biol.

Novel Di. George candidate n n n D. Lambrechts, P. Carmeliet, KUL Cardiovascular Biol. TBX 1 critical gene in typical 3 Mb aberration Atypical 2 Mb deletion (58 candidates) 43

YPEL 1 n YPEL 1 is expressed in the pharyngeal arches during arch development

YPEL 1 n YPEL 1 is expressed in the pharyngeal arches during arch development n YPEL 1 KD zebrafish embryos exhibit typical DGS-like features 44

Congenital heart disease genes n B. Thienpont, K. Devriendt, J. Vermeesch, KUL CME n

Congenital heart disease genes n B. Thienpont, K. Devriendt, J. Vermeesch, KUL CME n 60 patients without diagnosis n n Congenital heart defect & Chromosomal phenotype n n Array Comparative Genomic Hybridization n n 2 nd major congenital anomaly Or mental retardation/special education Or > 3 minor anomalies 1 Mb resolution 11 anomalies detected n n 5 2 3 1 deletions duplications complex rearrangements mosaic monosomy 7 45

Candidate regions n 4 regions with known critical genes, 6 new regions, 80 candidate

Candidate regions n 4 regions with known critical genes, 6 new regions, 80 candidate genes aberration gene del(5)(q 23) ? del(5)(q 35. 1) del(5)(q 35. 2 qter) NKX 2. 5 NSD 1 del(14)(q 22. 1 q 23. 1) ? del(22)(q 12. 2) ? dup(22)(q 11) TBX 1 dup(19)(p 13. 12 p 13. 11) del(9)(q 34. 3 qter), dup(20)(q 13. 33 qter) ? NOTCH 1, EHMT 1 del(13)(q 31. 1 q 31. 3), dup(13)(q 31. 3 q 33. 2), inv(13) ? del(4)(q 34. 3 q 35. 1), dup(4)(q 34), inv(4) ? 46

Gene prioritization del(14)(q 22. 1 q 23. 1) ? Expression KEGG pathways Pubmed textmining

Gene prioritization del(14)(q 22. 1 q 23. 1) ? Expression KEGG pathways Pubmed textmining data 1. CNIH 2. DACT 1 DAAM 1 3. KIAA 1344 4. CGRRF 1 Protein domains Cis-regulatory module BLAST Protein interactions EXOC 5 BMP 4 PTGER 2 RTN 1 DLG 7 BMP 4 DAAM 1 KIAA 1344 BMP 4 OTX 2 PTGDR ARID 4 A OTX 2 ARID 4 A WDHD 1 KIAA 0586 CDKN 3 SOCS 4 TIMM 9 WDHD 1 SAMD 4 DACT 1 ERO 1 L KTN 1 STYX SAMD 4 PSMA 3 DACT 1 OTX 2 DAAM 1 SOCS 4 BMP 4 5. DDHD 1 STYX DAAM 1 PSMA 3 6. ACTR 10 KTN 1 PSMC 6 OTX 2 7. CDKN 3 TIMM 9 PSMA 3 KTN 1 SOCS 4 FBXO 34 8. RTN 1 GNPNAT 1 PSMC 6 OTX 2 RTN 1 WDHD 1 9. FBXO 34 TBPL 2 WDHD 1 PSMC 6 KTN 1 SOCS 4 ERO 1 L CNIH KIAA 1344 BMP 4 FBXO 34 KIAA 1344 GCH 1 SOCS 4 DACT 1 KTN 1 CDKN 3 DACT 1 KTN 1 PLEKHC 1 DDHD 1 OTX 2 SAMD 4 DAAM 1 KIAA 1344 10. CNIH 11. PLEKHC 1 12. PSMA 3 DDHD 1 13. PLEKHC 1 WDHD 1 STYX 14. BMP 4 SAMD 4 KIAA 1344 PLEKHC 1 15. GCH 1 GMFB DACT 1 DAAM 1 STYX 16. KTN 1 DLG 7 OTX 2 FBXO 34 SAMD 4 GPR 135 … ACTR 10 PTGER 2 DLG 7 DAAM 1 80. … … … BMP 4 ARID 4 A DACT 1 ARID 4 A SOCS 4 EXOC 5 ERO 1 L DLG 7 PSMC 6 … KTN 1 STYX … … 47

Biological validation n Candidates currently being validated in zebrafish n n Screen about 50

Biological validation n Candidates currently being validated in zebrafish n n Screen about 50 candidates for heart expression at different developmental stages Morpholino knockdowns of candidates expressed in hearts n Screen for heart phenotypes 48

Putting it all together. . . 49

Putting it all together. . . 49

Integrating gene prioritization into daily biological work n Gene prioritization is “interesting”. . .

Integrating gene prioritization into daily biological work n Gene prioritization is “interesting”. . . n n How can we bring it closer to the daily routine of wet bench? n n Needs also to be integrated with “network” view of systems biology Still left with a large number of candidates Bioinformatics tool should not be trusted blindly Need for reinterpretation and “ownership” “Wikis” can be used as “collaborative electronic notebooks” n n Same technology as Wikipedia Addition of database back-end for structured information http: //homes. esat. kuleuven. be/~rbarriot/genewiki/index. php/CHD: Home http: //homes. esat. kuleuven. be/~rbarriot/genewiki/index. php/CHDGene: YM 70 50

51

51

52

52

53

53

K. U. Leuven ESAT-SCD-Bioi K. U. Leuven, ESAT-SCD • Bert Coessens • Leo Tranchevent

K. U. Leuven ESAT-SCD-Bioi K. U. Leuven, ESAT-SCD • Bert Coessens • Leo Tranchevent • Yu Shi • Tijl De Bie • Roland Barriot • Liesbeth Van Oeffelen • Bart De Moor • Yves Moreau K. U. Leuven CME-UZ • Jrois Vermeesch • Bernard Thienpont • Jeroen Breckpot • Koen Devriendt K. U. Leuven, VIB 4 • Stein Aerts • Bassem Hassan K. U. Leuven, DME-HGL • Peter Van Loo • Peter Marynen K. U. Leuven, VIB • Diether Lambrechts • Sunit Maity • Frederik De Smet • Peter Carmeliet T. U. Denmark CBS • Kasper Lage • Olof Karlsberg • Soeren Brunak http: //www. esat. kuleuven. ac. be/endeavour 54

Offline demo n Chediak-Higashi syndrome (OMIM: 214500) n n Syndrome mapped to 1 q

Offline demo n Chediak-Higashi syndrome (OMIM: 214500) n n Syndrome mapped to 1 q 42 -qter n n Psychomotor retardation Caused by mutation in LYST gene Gene prioritization n n Candidates from 1 q 42 -qter (353 candidates) Training genes: Gene Ontology category n n Brain development GO: 0007420 (60 genes) LYST gene ranks 9/353 55

56

56

57

57

58

58

59

59

60

60

61

61

62

62

63

63

64

64

65

65

66

66

67

67

68

68

Development track n 24/7 availability n Whole-genome scoring n Selection of relevant submodels n

Development track n 24/7 availability n Whole-genome scoring n Selection of relevant submodels n Support for more organisms n n n Cross-species data fusion n Currently separate Drosophila and Arabidopsis versions Mouse, fly, zebrafish, worm, yeast Ortholog mapping Meta-genes Flexible statistical back-end (R) 69

Cardiac abnormalities n Several cardiac abnormalities mapped to 3 p 22 -25 n n

Cardiac abnormalities n Several cardiac abnormalities mapped to 3 p 22 -25 n n Training set (10 -15 genes) n n Congenital heart defects (CHD) Test set n n Atrioventricular septal defect Dilated cardiomyopathy Brugada syndrome 3 p 22 -25, 210 genes High scoring genes n n n ACVR 2, SHOX 2 - linked to heterotaxy and Turner syndrome (often associated with CHD) Plexin-A 1 - reported as essential for chick cardiac morphogenesis Wnt 5 A, Wnt 7 A – neural crest guidance 70

Cross-validation … Area Under the Curve is a measure of performance AUC 0 1

Cross-validation … Area Under the Curve is a measure of performance AUC 0 1 - specificity sensitivity Plot on ROC curve Sensitivity = % of left out genes above threshold Specificity = % of genome below threshold 0 1 - specificity 71

CGH microarrays 72

CGH microarrays 72

CGH microarrays Comparative Genomic Hybridization microarray n Genomic DNA – NOT messenger RNA! Duplication

CGH microarrays Comparative Genomic Hybridization microarray n Genomic DNA – NOT messenger RNA! Duplication Deletion Intensity ratio n Position along chromosome 73

CGHGate Demo 74

CGHGate Demo 74

Functional genomics with array CGH 75

Functional genomics with array CGH 75

76

76

Bioinformatics research at ESAT-SCD n Microarray data analysis n Bioconductor, MIAME, MAGE-ML n n

Bioinformatics research at ESAT-SCD n Microarray data analysis n Bioconductor, MIAME, MAGE-ML n n Clustering n n n Ovarian cancer Least-square support vector machine M@c. Beth Compendium of Arabidopsis Gene Expression Array CGH n n Gibbs biclustering and query-driven biclustering Analysis of compendia n n Adaptive quality-based clustering Predictive models n n RMAGEML and bioma. Rt packages New experimental designs and statistical analysis Cis-regulatory sequence analysis n Gibbs sampling for motif finding n n Motif. Sampler Module discovery n TOUCAN (genetic algorithms, branch-and-bound) 77

Interest and expertise at ESAT-SCD n Gene prioritization n Genomic data fusion n Aberration

Interest and expertise at ESAT-SCD n Gene prioritization n Genomic data fusion n Aberration maps from array CGH Text mining n n Bag of words Chromosomal aberration maps from literature n n Endeavour ABand. Apart Probabilistic graphical models n n Applications of Gibbs sampling Bayesian networks and knowledge incorporation 78

Second Sym. Bio. Sys Workshop Leuven, May 29, 2007 Computational Systems Biology

Second Sym. Bio. Sys Workshop Leuven, May 29, 2007 Computational Systems Biology

Program n n n n n 10: 00 -10: 15: Welcome 10: 15 -10:

Program n n n n n 10: 00 -10: 15: Welcome 10: 15 -10: 45: Tutorial 1: Microarray preprocessing, Kristof Engelen, CMPG 10: 45 -11: 15: Coffee break 11: 15 -12: 00 Tutorial 2: Bioinformatic analysis of micro. RNAs, Stefan Lehnert, GEU 12: 00 -13: 00: Lunch 13: 00 -14: 00 Keynote: Genomic profiling of structural variation in health and disease, Joris Veltman, Department of Human Genetics, Radboud University Nijmegen Medical Centre, The Netherlands 14: 00 -14: 45 Tutorial 3: Proteomic Analysis from Gel to Mass Spectrometry: Possibilities, Limitations, and Recent Developments. Wannes D’Hertog, LEGENDO & Raf Van de Plas, ESAT-SCD 14: 45 -15: 15 Coffee break 15: 15 -16: 00: Tutorial 4: Data integration for candidate gene prioritization, Yves Moreau, ESAT-SCD 16: 00 -17: 00: Reception 80

Announcements n Sym. Bio. Sys group picture @ 12: 55 n BE THERE! n

Announcements n Sym. Bio. Sys group picture @ 12: 55 n BE THERE! n Lunch upstairs on 6 th floor n Tutorial 3 by Wannes D’Hertog and Raf Van de Plas n If you are not a Sym. Bio. Sys member and want to be informed of Sym. Bio. Sys seminars, please mail to Edwin Walsh edwin. walsh@esat. kuleuven. be 81