Mining your Personal Genome Jieming Chen Yale University

  • Slides: 47
Download presentation
Mining your Personal Genome Jieming Chen Yale University CBB 752 a 12

Mining your Personal Genome Jieming Chen Yale University CBB 752 a 12

What is Personal Genomics? • Personal genomics is the branch of genomics concerned with

What is Personal Genomics? • Personal genomics is the branch of genomics concerned with the sequencing and analysis of the genome of an individual -- Wikipedia • Is it not possible before? - Genetics VS genomics - Post-Human-Genome-Project (HGP) genomics

Nature (2010) 2000 2003 2006 2008 2010

Nature (2010) 2000 2003 2006 2008 2010

Personal Genomics 1. From basic research, to clinic, then to the masses 2. Tools

Personal Genomics 1. From basic research, to clinic, then to the masses 2. Tools to mine your own genome 3. Ethics and Privacy

Increasingly “personalized” genomics… GENOMICS IN BASIC RESEARCH

Increasingly “personalized” genomics… GENOMICS IN BASIC RESEARCH

Before mass sequencing - mass genotyping • Genotyping - Determination of the genotypes of

Before mass sequencing - mass genotyping • Genotyping - Determination of the genotypes of parts (usually genetic variations) of an individual’s genome using biological assays • SNP arrays Hybrid arrays - SNP (single nucleotide polymorphisms) genotyping - Main players: Affymetrix VS Illumina Affymetrix: http: //www. affymetrix. com/ Illumina : http: //www. illumina. com/

Before mass sequencing - mass genotyping • Array CGH (comparative genomic hybridization) - CNV

Before mass sequencing - mass genotyping • Array CGH (comparative genomic hybridization) - CNV (copy number variation) genotyping - Main players: Agilent VS Nimblegen - Main application: detection of genomic abnormalities in cancer detection of large structural aberrations (especially at the chromosomal level)

SNP arrays • Affymetrix • Illumina 1 K 100 K 10 K Xba 240

SNP arrays • Affymetrix • Illumina 1 K 100 K 10 K Xba 240 K 50 K Hind 100 K Xba 300 K 250 K Nsp 500 K SNP 5. 0 SNP 6. 0 Axiom Sty 550 K 610 K 650 K • Probes on microarray technology Affymetrix Axiom Solutions http: //www. affymetrix. com/ 1 M Omni

SNP selection in array design 1) SNP quantity - limited by microarray technology 2)

SNP selection in array design 1) SNP quantity - limited by microarray technology 2) SNP content - random probes or probes for ‘tag’ SNPs - random probes are produced by specific enzymes in some array technology - ‘tag’ SNPs is one that represents a group of SNPs in a genomic region due to a phenomenon called, linkage disequilibrium (LD). - LD refers to the non-random association of alleles at 2 or more loci. - Haplotypes refers to a certain configuration of alleles that are transmitted together (or assumed to be). - One can, in theory, predict the larger group of SNPs with a smaller set of SNPs

Linkage Disequilibrium Parent 2 Parent 1 A B a b X OR a b

Linkage Disequilibrium Parent 2 Parent 1 A B a b X OR a b A B a b High LD -> No Recombination (r 2 = 1) SNP 1 “tags” SNP 2 A B a b A b a B A B A b etc… Low LD -> Recombination Many possibilities ASHG 2008 Hapmap Tutorial: http: //hapmap. ncbi. nlm. nih. gov/tutorials. html. en

The International Hap. Map Project • Largely exploited the idea of haplotypes and LD

The International Hap. Map Project • Largely exploited the idea of haplotypes and LD - reduce cost (sequencing is expensive) - capitalize on microarray technology • Involved Illumina, Affymetrix, >20 institutions worldwide • Hap. Map 1 (2003) and Hapmap 2 (2005) - 4 populations (270 indiv): CEU (NW European from Utah), CHB (Han Chinese from Beijing), JPT (Japanese from Tokyo), YRI (Yoruban from Nigeria) • Hapmap 3 (2010) - 11 populations (4+7, 1301 indiv)

The International Hap. Map Project • Provided the foundation for future human genomic projects:

The International Hap. Map Project • Provided the foundation for future human genomic projects: www. hapmap. org - maturation of the microarray technology - tool development from industry and academia - the use of common variations in disease studies and genome-wide association studies (GWAS) - population-specific genetic differences - samples - consent and ethical issues • Major limitations: 1) coverage (the entire genome is not covered) 2) rare variants are unlikely to be uncovered 3) population-based genome-wide studies

Even with limited information, genomics is getting “personalized”… Basic • Human reference genome refinement

Even with limited information, genomics is getting “personalized”… Basic • Human reference genome refinement • Human evolution and natural selection • Comparative genomics A C T G Ancestry of individuals • Population structure • Human migration route • Haplotyping • Linguistics Clinical applications • Pharmacogenetics/genomics • Disease associations ETC…… HUGO PASNP Consortium (2009), Science

Heralding the personal genomes • Hap. Map. P 3 draft 1 came out in

Heralding the personal genomes • Hap. Map. P 3 draft 1 came out in 2009 and paper published in 2010 • Venter genome (2007) and Watson genome (2008) • Faster, cheaper and more accurate sequencing technologies Transitioning into personal genomes • 2009 -2011, 1000 Genomes Project sequenced 1092 genomes from 14 different populations

2007 2008 2009 2009 2010

2007 2008 2009 2009 2010

Further into the personal genome • Beyond simply sequencing the personal genome • If

Further into the personal genome • Beyond simply sequencing the personal genome • If a family trio is sequenced (mum, dad, child), one can potentially phase the variations of the child into its maternal and paternal alleles. * • Phasing refers to the determination of the haplotype of an individual’s sequence. • It can be done experimentally (not feasible for largescale phasing) or computationally. • Typical computational phasing algorithms include the use of HMM (e. g. BEAGLE, Browning & Browning 2007, AJHG) and EM (e. g. fast. PHASE, Scheet & Stephens 2006, AJHG). *Note that phasing can also be done with unrelated individuals but you won’t know the maternal or paternal chromosomes

Phasing Simple example of phased sequence of the child (as opposed to ‘unphased’, highlighted

Phasing Simple example of phased sequence of the child (as opposed to ‘unphased’, highlighted black) Father A B c D Mother a B c d A B C d Child A a B B c C d d a B c d Parent 1 Parent 2 Child Informative to phase child’s genome? Homozygous Any Yes Homozygous Heterozygous Any Yes Heterozygous Homozygous Yes Heterozygous No Adapted from: http: //www. chromosomechronicles. com/2009/09/30/use-familysnp-data-to-phase-your-own-genome/

Allele-specific binding (ASB) and expression (ASE) Possible causes for ASB/ASE 1) Epigenetic effects, e.

Allele-specific binding (ASB) and expression (ASE) Possible causes for ASB/ASE 1) Epigenetic effects, e. g. imprinting, where methylation silences a maternal/paternal gene 2) Genetic variations (such as SNPs) disrupting a binding motif or modifying a gene on a single parental haplotype 3) Random mono-allelic expression/binding Clinical examples 1) Angelman Syndrome – maternal gene(s) on chromosome 15 inactivated or deleted, paternal gene imprinted 2) Prader-Willi Syndrome – paternal gene(s) on chromosome 15 inactivated or deleted, maternal gene imprinted Using a phased genome to study ASB and ASE • Integrate phased sequence with Ch. IP-seq (binding) and RNA-seq (expression) data to obtain allele-specific information in binding and expression (Rozowsky J et. al. 2011)

“Personalization in progress… Watch this space” PERSONAL GENOMICS IN CLINICAL RESEARCH

“Personalization in progress… Watch this space” PERSONAL GENOMICS IN CLINICAL RESEARCH

Personal genomics in Clinic Some areas that clinicians are interested in that genomics can

Personal genomics in Clinic Some areas that clinicians are interested in that genomics can potentially improve: • Disease prediction • Pharmocogenetics/genomics • Response to therapy • Patient care (personalized environmental and epigenetic information, patient data privacy etc. ) • Personalized medicine and healthcare Examples of some genomic technologies in clinical research 1) Genome-Wide Association Studies 2) Exome sequencing 3) Pharmacogenetics/genomics 4) Gene expression profiles via RNA-seq Mc. Carthy et. al. 2008

Genome-Wide Association Studies (GWAS) • First successful GWAS was done at Yale, in 2005

Genome-Wide Association Studies (GWAS) • First successful GWAS was done at Yale, in 2005 for age-related macular degeneration (AMD) (Klein R. et. al. 2005, Science) - 96 cases, 50 controls, 116 K SNPs Klein R. et. al. 2005

GWAS • Perpetuated by Hap. Map and microarray technology • Hypothesis-free • Main aims:

GWAS • Perpetuated by Hap. Map and microarray technology • Hypothesis-free • Main aims: 1) to find the molecular pathways/mechanisms of complex diseases/traits 2) to find genetic markers that these phenotypes are associated with • Common-disease-common-variant hypothesis - phenotypes are results of cumulative effects of a number of common variants, with at best modest effect sizes Mc. Carthy et. al. 2008, Nature Reviews

GWAS • Usually SNP-based • Conduct association tests for each SNP between case VS

GWAS • Usually SNP-based • Conduct association tests for each SNP between case VS control to see if there is a significant difference between 2 cohorts. Allele # Cases # Controls A n. A, case n. A, ctrl B n. B, case n. B, ctrl where and n is the minor allele frequency. Mc. Carthy et. al. 2008, Nature Reviews

GWAS Limitations • Note that even though termed “whole genome”, GWAS till now work

GWAS Limitations • Note that even though termed “whole genome”, GWAS till now work mostly with microarray tech use ‘tag SNPs’ which are in LD with many other SNPs GWAS may not (and typically do not) find the causative variant. • High number of false positives with array-based GWAS currently, the GWAS variants explained only a small genetic fraction of common disease risk • Heading towards sequencing-based GWAS, especially in looking at uncommon or rare variants

GWAS Limitations (cont’d) • Results can be population-specific, e. g. Type 2 diabetes risk

GWAS Limitations (cont’d) • Results can be population-specific, e. g. Type 2 diabetes risk allele frequencies decrease from Sub-Saharan Africa through Europe to East Asia However, they did provide new insights into novel disease-associated pathways and mechanisms – for instance in AMD. Catalog of GWAS http: //www. genome. gov/26525384 Chen R et. al. (2012), PLo. S Genetics

Pharmacogenetics/genomics • Pharmacogenetics - refers to the study of genetic variations of individual patient

Pharmacogenetics/genomics • Pharmacogenetics - refers to the study of genetic variations of individual patient responses to drugs, conventionally in single or a small set of genes • Pharmacogenomics - refers to large-scale/genome-wide study of genetic variations of individual patient responses to drugs

Interethnic variations in drug responses • Warfarin is a classic example. a very widely-used

Interethnic variations in drug responses • Warfarin is a classic example. a very widely-used anti-coagulant and one of the most well-studied drug extremely difficult to dose because of a narrow therapeutic window genes with haplotypes that affect dosage: VKORC 1 and CYP 2 C 9 Warfarin sensitivity (on average): Asians>Caucasians>African Americans Rettie A & Tai G (2006), Molecular Interventions Review

Quantifying interethnic variation in the genome: an application • A popular measure in population

Quantifying interethnic variation in the genome: an application • A popular measure in population genetics is the fixation index, FST, which essentially measures population differentiation. Chen J et. al. (2010), Pharmacogenomics

A peek into a potential future 1. Charcot-Marie-Tooth neuropathy (Lupski et. al. , 2010,

A peek into a potential future 1. Charcot-Marie-Tooth neuropathy (Lupski et. al. , 2010, NEJM) Whole genome sequencing of the lead author himself, who has the disease, and his family found 2 causative mutations associated with the disease, on a region on chromosome 5 affecting SH 3 TC 2 (SH 3 and tetratricopeptide repeats 2 gene) 2. The Snyder Experiment (Chen R et. al. , 2012, Cell) integration of genomic, transcriptomic, proteomic, metabolomic, and autoantibody profiles of a single healthy individual over a 14 -month period revealed a predisposition to Type 2 diabetes despite having no family history

“knowledge is mightier, IF you wield it right” EMPOWERING THE MASSES

“knowledge is mightier, IF you wield it right” EMPOWERING THE MASSES

Personal genomics for the Masses What can you mine from your own genome? How

Personal genomics for the Masses What can you mine from your own genome? How can you mine your own genome? What can you tell from your own genome? • • Disease susceptibility Ancestry Pharmocogenetics Traits ETC.

The Bottom-up Pyramid Information Flow Public Clinical research Basic Research

The Bottom-up Pyramid Information Flow Public Clinical research Basic Research

Personal genomics for the Masses • Unprecedented accessibility to the public • Brought about

Personal genomics for the Masses • Unprecedented accessibility to the public • Brought about by direct-to-consumer genomic companies Big 3: De. Code, Navigenics, 23 and. Me

23 and. Me • Genotype ~ 1 million SNPs per genome • Illumina Omni.

23 and. Me • Genotype ~ 1 million SNPs per genome • Illumina Omni. Express customized microarray • Ancestry Traits Drug response Disease risks • Provides your raw data which you can download 34

Beyond 23 and. Me – Ancestry Population panels: HAPMAP (Intl Hap. Map Consortium 2003,

Beyond 23 and. Me – Ancestry Population panels: HAPMAP (Intl Hap. Map Consortium 2003, Nature) HGDP (Li JZ et. al. 2008, Science) PASNP (HUGO PASNP Consortium 2009, Science) SGVP (Teo YY et. al. 2009, Gen. Res. ) Middle East 35 Inset: http: //www. clker. com/clipart-9213. html

Beyond 23 and. Me – Ancestry Chen J et. al. (2009), AJHG 36

Beyond 23 and. Me – Ancestry Chen J et. al. (2009), AJHG 36

PCA in genomic data SNPs in LD can skew PCA Modified PCA (Price et.

PCA in genomic data SNPs in LD can skew PCA Modified PCA (Price et. al. (2006), Nat Genet) • 0, 1, 2 represent the genotypes of SNPs (0=AA, 1=AB, 2=BB, assuming biallelic SNPs) Samples • then instead of normalizing by column, SNPs sample YOU CEU 1 CEU 2 CEU 3 SNP 1 1 0 normalize by row SNP 2 2 1 2 0 • variables = individuals SNP 3 1 2 2 0 • observations = SNPs • correlation matrix of individuals • plot PC 1 vs PC 2 by loadings (variables) instead of by PC scores (observations)

PCA interpretation • Genetic differentiation by geography • Studies that showed cultural, linguistic and

PCA interpretation • Genetic differentiation by geography • Studies that showed cultural, linguistic and historical association with such pattern Novembre J et. al. 2008, Nature

International Stem Cell Consortium (2011), Nat Biotech

International Stem Cell Consortium (2011), Nat Biotech

Disease status • Considerations: population panel in which your results are based on how

Disease status • Considerations: population panel in which your results are based on how wellstudied is the disease

Mendelian diseases • High penetrance • Highly likely to be detected, hence the results

Mendelian diseases • High penetrance • Highly likely to be detected, hence the results are more likely to be true • Some populations might have a higher rate 2009 Rosner et. al. Annu. Rev. Genomics. Hum. Genet.

Drug Response

Drug Response

Ancestry Neanderthal 43

Ancestry Neanderthal 43

Tools to mine your own genome Projects/software from the public • Dienekes Pontikos -

Tools to mine your own genome Projects/software from the public • Dienekes Pontikos - EURO-DNA-CALC Dienekes Anthropology Blog http: //dienekes. blogspot. com/2008/06/euro-dna-calc-11 -released. html • Dodecad Project http: //dodecad. blogspot. com • Eurogenes http: //eurogenes. blogspot. com/ Other resources • Galaxy (http: //galaxy. psu. edu/) • Interpretome (http: //esquilax. stanford. edu/) • SNPTips Firefox browser extension (http: //snptips. 5 amsolutions. com/) • SNPedia/Promethease (http: //www. snpedia. com/index. php/Promethease) • A comprehensive list of tools to probe 23 and. Me data. http: //www. 23 andyou. com/3 rdparty

GALAXY • • • http: //galaxy. psu. edu/ Web-based platform Designed for anybody to

GALAXY • • • http: //galaxy. psu. edu/ Web-based platform Designed for anybody to use Workflow concept GALAXY demo

Genomic elements discovery and annotation Academia: • Human Genome Project • Hapmap • 1000

Genomic elements discovery and annotation Academia: • Human Genome Project • Hapmap • 1000 Genomes Project Clinic: • Disease association • Pharmacogenetics • Biomarkers Industry Expedite the democratization process • Navigenics • 23 and. Me • de. Code. Me • Illumina • Affymetrix Academia + Clinic Industry Everybody else

Some Privacy and ethical issues • Privacy can your identity really be kept anonymous

Some Privacy and ethical issues • Privacy can your identity really be kept anonymous in a research project? Li et. al. 2004, Science “Our calculations show that measuring as few as 75 statistically independent SNPs would define a small group that contained the real owner of the DNA. ” • Ethics how much, if at all, of your genomic information do you own? where do biological relatives stand in all these? genetic discrimination especially with insurance companies