The International Hap Map Project a Rich Resource

  • Slides: 52
Download presentation
The International Hap. Map Project: a Rich Resource of Genetic Information Julia Krushkal Department

The International Hap. Map Project: a Rich Resource of Genetic Information Julia Krushkal Department of Preventive Medicine The University of Tennessee Health Science Center jkrushka{at}utmem. edu

Hap. Map Population Samples Project launched in 2002 to provide a public resource for

Hap. Map Population Samples Project launched in 2002 to provide a public resource for accelerating medical genetic research 270 Individuals from 4 Geographically Diverse Populations YRI: 90 Yorubans from Ibadan, Nigeria 30 parent-offspring trios CEU: 90 northern and western European-descent living in Utah, USA from the Centre d’Etude du Polymorphisme Humain (CEPH) collection 30 parent-offspring trios CHB: 45 unrelated Han Chinese from Beijing, China JPT: 45 unrelated Japanese from Tokyo, Japan http: //www. hapmap. org/ Hap. Map http: //www. genome. gov/page. cfm? page. ID=10001688 NHGRI

The International Hap. Map Project “…Determine the common patterns of DNA sequence variation in

The International Hap. Map Project “…Determine the common patterns of DNA sequence variation in the human genome, by characterizing sequence variants, their frequencies, and correlations between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. ” Nature (2003) • • Population-specific sequence variation Allele frequencies Linkage disequilibrium patterns Haplotype information Tag SNPs Structural genome variation Better understanding of human population dynamics and of the history of human populations • Cell lines available from Coriell Inst. for Medical Research • A rich resource for biomedical genetic analysis

International Hap. Map Project Papers • The Int. Hap. Map Consortium. A second generation

International Hap. Map Project Papers • The Int. Hap. Map Consortium. A second generation human haplotype map of over 3. 1 million SNPs. Nature 449, 851 -861. 2007 • The Int. Hap. Map Consortium. A Haplotype Map of the Human Genome. Nature 437, 1299 -1320. 2005 • The Int. Hap. Map Consortium. The International Hap. Map Project. Nature 426, 789 -796. . 2003 • The Int. Hap. Map Consortium. Integrating Ethics and Science in the International Hap. Map Project. Nature Reviews Genet 5, 467 -475. 2004 • Thorisson et al. The International Hap. Map Project Web site. Genome Res 15: 1591 -1593. 2005 Hap. Map-related papers • Sabeti et al. Genome-wide detection and characterization of positive selection in human populations. Nature 449, 913 -918. 2007. • Clark et al. Ascertainment bias in studies of human genome-wide polymorphism. Genome Res, 15: 1496 -1502. 2005 • Clayton et al. Population structure, differential bias and genomic control in a large-scale, casecontrol association study. Nature Genet 37(11): 1243 -1246. 2005 • de Bakker et al. Efficiency and power in genetic association studies. Nature Genet, 37(11): 1217 -1223 2005 • Goldstein, Cavalleri. Genomics: Understanding human diversity. Nature 437: 1241 -1242. 2005. • Hinds et al. Whole genome patterns of common DNA variation in three human populations. Science 307: 1072 -1079. 2005. • Myers et al. A fine-scale map of recombination rates and hotspots across the human genome. Science, 310: 321 -324. 2005 • Nielsen R et al. Genomic scans for selective sweeps using SNP data. Genome Res 15: 1566 -1575. 2005 • Smith et al. Sequence features in regions of weak and strong linkage disequilibrium. Genome Res 15: 1519 -1534. 2005 • Weir et al. Measures of human population structure show heterogeneity among genomic regions. Genome Res 15: 1468 -1476. 2005.

Nature (2003)

Nature (2003)

Human Chromosomes • Contain DNA • 22 pairs of autosomes + sex-chromosomes (X and

Human Chromosomes • Contain DNA • 22 pairs of autosomes + sex-chromosomes (X and Y) + mitochondrial genome • Contain functional units (genes) and other DNA Human genome sequence is available as a reference, as a result of the Human Genome Project A significant amount of inter-individual variation exists

Some Basic Definitions Locus - A site in the genome The DNA in the

Some Basic Definitions Locus - A site in the genome The DNA in the human genome is not a static entity. There are differences between different copies: Allele – a genetic variant, i. e. , a form (state) of a locus Mutation - a genetic change An individual carries two copies of each locus on autosomes Individual alleles are inherited from parents to offspring (1 from each parent) Genotype - A set of alleles an individual is carrying at a given locus

Chromosomes are sets of continuously linked genetic loci Example: Integrated map of chromosome 5

Chromosomes are sets of continuously linked genetic loci Example: Integrated map of chromosome 5 from the International Hap. Map Project, http: //www. hapmap. org

Genetic Variation • Some DNA loci vary among individuals • Linked genetic loci are

Genetic Variation • Some DNA loci vary among individuals • Linked genetic loci are inherited non-independently • Loci may change with time (mutation, selection, genetic drift) • Some DNA changes lead to quantitative changes in RNA expression and to quantitative or qualitative changes in protein production • Some genetic changes, even small, may lead to disease • A large amount of natural variation occurs in healthy individuals, i. e. , many changes are neutral • Loci genetically linked to the disease-causing locus can be used as genetic markers to search for the disease locus SNP 1 SNP 2 There are many types of DNA variation, e. g. Sequence variation AAAC/TGGCTA Microsatellite repeats …AATG AATG…

Polymorphic Site A locus with common DNA variation 2 alleles in a population Shows

Polymorphic Site A locus with common DNA variation 2 alleles in a population Shows difference in DNA sequence among individuals In most definitions: the most common allele with frequency < 99%, or minor allele frequency (MAF) 1%, or MAF 2%, or at least two alleles have frequencies 1%. A rare allele that occurs in <1% of the population is usually non considered a polymorphic site.

SNP=Single Nucleotide Polymorphism A SNP locus on the distal end of the long arm

SNP=Single Nucleotide Polymorphism A SNP locus on the distal end of the long arm of human chromosome 5 (data from Ensembl) SNP locus rs 6870660 http: //www. ensembl. org CAAATTCCATG[A or C]AGAAGGAAATACAT A and C are alleles at SNP locus rs 6870660

A SNP locus on the distal end of the long arm of chromosome 5

A SNP locus on the distal end of the long arm of chromosome 5 SNP locus rs 6870660 http: //www. hapmap. org

<> Regulatory Interactions: The ENCODE Project 2003 -Pilot project launched (1% of the genome)

<> Regulatory Interactions: The ENCODE Project 2003 -Pilot project launched (1% of the genome) 2007 - Pilot project completed; production phase launched on the entire genome Production Scale Effort Pilot Scale Effort Data Coordination Center Technology Development Effort High-through-put experimental and computational approaches to studies of DNA regulatory sites, regulatory interactions, and DNA modification

Genome SNP Variation Size of human genome is 3. 2 109 bp 99. 9%

Genome SNP Variation Size of human genome is 3. 2 109 bp 99. 9% identical 9 -10 mln SNPs may have MAF 5% 30, 000 genes Hap. Map SNP Density Coverage • Phase I (published in 2005) 1, 007, 329 SNPs that passed quality control 1 SNP / 3000 bp 11, 500 ns. SNP 10 ENCODE regions, 500 kb each The cumulative number of non 17, 944 SNPs redundant SNPs (each mapped to a single location in the genome) is 1 SNP / 279 bp shown as a solid line, as well as the • Phase II (published in 2007) number of SNPs validated by >3, 806, 000 SNPs genotyping (dotted line) and double 1 SNP / 875 bp hit status (dashed line). Years are 25 -30% of all SNPs with MAF 5% divided into quarters (Q 1–Q 4).

http: //www. hapmap. org/

http: //www. hapmap. org/

SNP Differences among Individuals Far Exceed Differences among Populations Phase 1: Autosomes: Across the

SNP Differences among Individuals Far Exceed Differences among Populations Phase 1: Autosomes: Across the 1 million SNPs genotyped, only 11 have fixed differences between CEU and YRI, 21 between CEU and CHB/JPT, and 5 between YRI and CHB/JPT. X chromosome 123 SNPs were completely differentiated between YRI and CHB/JPT, but only 2 between CEU and YRI and 1 between CEU and CHB/JPT.

Haplotypes A haplotype is a set of alleles at multiple loci located on the

Haplotypes A haplotype is a set of alleles at multiple loci located on the same copy of the chromosome Genotype calls obtained from sequencing or DNA chip genotyping do not provide the information about which of the two chromosomal copies a particular allele belongs to. E. g. , genotypes for individual X: SNP# SNP A SNP B SNP C Genotypes A 1 A 2 A T B 1 B 2 T C C 1 C 2 G C Haplotype 1 Haplotype 2 Haplotypes A C C A 1 B 2 C 2 A 2 B 1 C 1 T T G

Recombination “Random” event Occurs during meiosis The larger the distance between loci or as

Recombination “Random” event Occurs during meiosis The larger the distance between loci or as more generations pass, the more likely recombination(s) will occur A 1 B 1 A 2 B 2 x A 2 A 1 B 2 B 1 A 2 Nonrecombinant Haplotypes B 2 A 1 Recombination (crossing-over) B 2 A 2 Recombinant Haplotypes B 1

Two ancestral chromosomes being scrambled through recombination over many generations to yield different descendant

Two ancestral chromosomes being scrambled through recombination over many generations to yield different descendant chromosomes. If an A allele on the ancestral chromosome increases the risk of a disease, the two individuals in the current generation who inherit that part of the ancestral chromosome will be at increased risk. Source: the International Hap. Map Project

Linkage Disequilibrium Associations among alleles at different loci A 1 B 1 A 2

Linkage Disequilibrium Associations among alleles at different loci A 1 B 1 A 2 B 2 Locus A Locus B Normalized disequilibrium coefficient Correlation coefficient D = Linkage disequilibrium coefficient Coefficient of association D=p. A 1 B 1 -p. A 1 p. B 1 D’=D/|D|max |D| max = | min(p. A 1 p. B 2, p. A 2 p. B 1)| -1 D’ 1 =D/ p. A 1 p. A 2 p. B 1 p. B 2 In case of no association, D=0 (linkage equilibrium) Practical implications in fine gene mapping: Search for locus B using association of marker loci with disease

The value of D decreases geometrically with each generation A B D(t)=(1 - )

The value of D decreases geometrically with each generation A B D(t)=(1 - ) D(t-1) D(t)=(1 - ) t. D(0) a b Unless the two loci are closely linked, the value of D should rapidly decrease to 0. The occurrence of association between two loci implies that they are closely linked.

Haplotype Maps Generated by The International Hap. Map Project 3 steps of construction the

Haplotype Maps Generated by The International Hap. Map Project 3 steps of construction the Hap. Map (a) SNPs are identified in DNA samples from multiple individuals. (b) Adjacent SNPs that are inherited together are compiled into haplotypes. (c)"Tag" SNPs are identified within haplotypes that uniquely describe those haplotypes. Source: The Hap. Map Project International

Haplotype Maps of the Human Genome Helmuth 2001, Science 293: 583 -585 Find correlations

Haplotype Maps of the Human Genome Helmuth 2001, Science 293: 583 -585 Find correlations among groups of SNPs Haplotypes were inferred for the Hap. Map project from trios data and from unrelated individuals using Phase (Stephens 01; Stephens and Donnely 03)

Haplotype Maps of the Human Genome regions decomposed into discrete haplotype blocks, which capture

Haplotype Maps of the Human Genome regions decomposed into discrete haplotype blocks, which capture similarity in haplotype organization Patil et al. 2001, Blocks of Limited Haplotype Diversity Revealed by High. Resolution Scanning of Human Chromosome 21. Science 294(5547): 1719 -23

Haplotype Block Partition Results for Three Populations 1, 586, 383 (SNPs) genotyped in 71

Haplotype Block Partition Results for Three Populations 1, 586, 383 (SNPs) genotyped in 71 Americans of European, African, and Asian ancestry Population Blocks Average size, kb* Required SNPs African-American 235, 663 8. 8 570, 886 European-American 109, 913 20. 7 275, 960 Han Chinese 89, 994 25. 2 220, 809 * Average distance spanned by segregating sites in each block. Minimum number of SNPs required to distinguish common haplotype patterns with frequencies of 5% or higher. Hinds et al. 2005 Science

Hinds et al 2005 Extended LD bin and haplotype block structure around the CFTR

Hinds et al 2005 Extended LD bin and haplotype block structure around the CFTR gene. LD bins, where each bin has at least one SNP with r 2 > 0. 8 with every other SNP, are depicted as light horizontal bars, with the positions of constituent SNPs indicated by vertical tick marks as well as the extreme ends of the bars. Isolated SNPs are indicated by plain tick marks. Haplotype blocks, within which at least 80% of observed haplotypes could be grouped into common patterns with frequencies of at least 5%, are depicted as dark horizontal bars. Unlike haplotype blocks that are by design sequential and nonoverlapping, SNPs in one LD bin can be interdigitated with SNPs in multiple other overlapping bins Population differences in local bin structure Differences in allele and haplotype frequencies “Although analysis panels are characterized both by different haplotype frequencies and, to some extent, different combinations of alleles, both common and rare haplotypes are often shared across populations” (The Int. Hap. Map Project, Nature, 2005)

Tag SNP (ht. SNP) selection Pairwise LD-based and haploblock-based tagging methods Partition haplotypes into

Tag SNP (ht. SNP) selection Pairwise LD-based and haploblock-based tagging methods Partition haplotypes into blocks Can use haplotype-based (haploblocks) or genotype-based (LD-blocks) partitioning Select representative ht. SNPs from each block Latest DNA microarrays aim to capture SNPs with r 2 0. 8 “Tags are the subset of variants genotyped in a disease study. SNPs that are not typed in the study but whose effect can be studied through LD with a tag are termed proxies. A tag with perfect correlation (r 2 = 1) to an untyped putative causal allele is termed a perfect proxy. ” De Bakker et al. , 2005

Tag SNP, Haplotypes, and LD The Int. Hap. Map Consortium, Nature, 2005

Tag SNP, Haplotypes, and LD The Int. Hap. Map Consortium, Nature, 2005

Use of Haplotypes in Association Analysis • Testing one marker at a time for

Use of Haplotypes in Association Analysis • Testing one marker at a time for associations is very timeconsuming • Problem of multiple testing • Testing individual SNPs, we are not utilizing information from other markers Benefits of Using Haplotypes • Haplotypes allow us to use information from multiple loci simultaneously • LD information between loci is captured

Benefits of Haplotype Analysis • Construct a single highly informative mega-locus from a number

Benefits of Haplotype Analysis • Construct a single highly informative mega-locus from a number of less informative but closely linked loci • Identify genotyping or data entry errors. Likelihood ratio tests indicate which typings are more likely to be an error • Find boundaries of conserved haplotypes associated with a trait. • Employs recombinations from the entire history a population

Amount of Captured Sequence Variation in Hap. Map Phase II For common variants (MAF

Amount of Captured Sequence Variation in Hap. Map Phase II For common variants (MAF 0. 05) the mean maximum r 2 of any SNP to a typed one is 0. 90 in YRI, 0. 96 in CEU and 0. 95 in CHB /JPT. 1. 09 million SNPs capture all common Phase II SNPs with r 2 0. 8 in YRI. Very common SNPs with MAF 0. 25 are captured extremely well (mean maximum r 2 of 0. 93 in YRI to 0. 97 in CEU) Rarer SNPs with MAF, 0. 05 are less well covered (mean maximum r 2 of 0. 74 in CHB/JPT to 0. 76 in YRI).

Recombination Hot Spots

Recombination Hot Spots

Structural Genome Variation Hap. Map samples are also used as a resource for CNV

Structural Genome Variation Hap. Map samples are also used as a resource for CNV analysis • Large number of copy number variants (CNVs) and other genome rearrangements found among individuals • Some variation is assumed normal, other may cause disease • Genome databases, e. g. Database of Genomics Variants at the TCAG of the Toronto Hospital of Sick Children, the Copy Number Variation Project Map at the Sanger Center

 • Segmental duplications are recombination hotspots, causing global genome rearrangements

• Segmental duplications are recombination hotspots, causing global genome rearrangements

Hap. Map Genome Browser

Hap. Map Genome Browser

Perlegen Genotype Browser

Perlegen Genotype Browser

UCSC Genome Browser http: //genome. ucsc. edu/

UCSC Genome Browser http: //genome. ucsc. edu/

DNA Chips and Resequencing: High-through-put Analysis of Sequence Variation An easy way to access

DNA Chips and Resequencing: High-through-put Analysis of Sequence Variation An easy way to access genome-wide variation Both Affymetrix and Illumina DNA chips contain representative SNP and CNV probes Affymetrix Gene. Chip 6. 0: 1. 8 million markers for genetic variation, including 906, 000 SNPs and 946, 000 copy number probes. Illumina 1 M Bead Chip and 1 M-duo Bead Chip: ~950, 000 genome-spanning tag SNPs; ~100, 000 additional non-Hap. Map SNPs, >565, 000 SNPs in and near coding regions such as ns. SNPs, promoter regions, 3’ and 5’ UTRs; dense coverage in ADME and MHC regions. ~260, 000 markers located in novel and reported copy number polymorphic regions. Sequenom mass arrays (based on Maldi-TOF)

Genome-Wide Association Select representative ht. SNPs from low diversity haplotype blocks Adjustment for multiple

Genome-Wide Association Select representative ht. SNPs from low diversity haplotype blocks Adjustment for multiple comparisons LD values highly variable: smoothing function needed Haplotypes in a sliding window OR screen for top SNPs likely functional SNPs in genes involved in pathways of interest

Use of Phase-Resolved Data in Association Analysis • Find association with haplotypes similar to

Use of Phase-Resolved Data in Association Analysis • Find association with haplotypes similar to analyses of individual SNP alleles; Need to consider multiple testing • Test for tendency of cases to ‘cluster’ around groups of ‘similar’ haplotypes • Extend log-linear approach to take haplotype structure into account Modifications also used for ambiguous phase

http: //www. genome. gov/26525384 As of 04/14/2008, GWAS of 150 traits posted

http: //www. genome. gov/26525384 As of 04/14/2008, GWAS of 150 traits posted

Special Thanks to • Ken Manly, whose presentation ideas for the Hap. Map module

Special Thanks to • Ken Manly, whose presentation ideas for the Hap. Map module 2006 inspired and helped organized this presentation