Statistical Methods for Quantitative Trait Loci QTL Mapping

  • Slides: 31
Download presentation
Statistical Methods for Quantitative Trait Loci (QTL) Mapping Lectures 4 – Oct 10, 2011

Statistical Methods for Quantitative Trait Loci (QTL) Mapping Lectures 4 – Oct 10, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12: 00 -1: 20 Johnson Hall (JHN) 022 1

Outline n Learning from data n n Basic concepts n n n Maximum likelihood

Outline n Learning from data n n Basic concepts n n n Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) Expectation-maximization (EM) algorithm Allele, allele frequencies, genotype frequencies Hardy-Weinberg equilibrium Statistical methods for mapping QTL n n What is QTL? Experimental animals Analysis of variance (marker regression) Interval mapping (EM) 2

Continuous Space Revisited. . . n Assuming sample x 1, x 2, …, xn

Continuous Space Revisited. . . n Assuming sample x 1, x 2, …, xn is from a mixture of parametric distributions, x 1 x 2 … x m X xm+1 … xn x 3

A Real Example n Cp. G content of human gene promoters GC frequency “A

A Real Example n Cp. G content of human gene promoters GC frequency “A genome-wide analysis of Cp. G dinucleotides in the human genome distinguishes two distinct classes of promoters” Saxonov, Berg, and Brutlag, PNAS 2006; 103: 1412 -1417 4

Mixture of Gaussians Parameters θ means variances mixing parameters P. D. F 5

Mixture of Gaussians Parameters θ means variances mixing parameters P. D. F 5

A What-If Puzzle Likelihood n n No closed form solution known for finding θ

A What-If Puzzle Likelihood n n No closed form solution known for finding θ maximizing L. However, what if we knew the hidden data? 6

EM as Chicken vs Egg n IF zij known, could estimate parameters θ n

EM as Chicken vs Egg n IF zij known, could estimate parameters θ n n e. g. , only points in cluster 2 influence μ 2, σ2. IF parameters θ known, could estimate zij n e. g. , if |xi - μ 1|/σ1 << |xi – μ 2|/σ2, then zi 1 >> zi 2 Convergence provable? YES n BUT we know neither; (optimistically) iterate: n n n E-step: calculate expected zij, given parameters M-step: do “MLE” for parameters (μ, σ), given E(zij) Overall, a clever “hill-climbing” strategy 7

Simple Version: “Classification EM” n If zij < 0. 5, pretend it’s 0; zij

Simple Version: “Classification EM” n If zij < 0. 5, pretend it’s 0; zij > 0. 5, pretend it’s 1 i. e. , classify points as component 0 or 1 n Now recalculate θ, assuming that partition n Then recalculate zij , assuming that θ n Then recalculate θ, assuming new zij , etc. 8

EM summary n Fundamentally an MLE problem n EM steps n n n E-step:

EM summary n Fundamentally an MLE problem n EM steps n n n E-step: calculate expected zij, given parameters M-step: do “MLE” for parameters (μ, σ), given E(zij) EM is guaranteed to increase likelihood with every E -M iteration, hence will converge. n But may converge to local, not global, max. n Nevertheless, widely used, often effective 9

Outline n Basic concepts n n n Allele, allele frequencies, genotype frequencies Hardy-Weinberg equilibrium

Outline n Basic concepts n n n Allele, allele frequencies, genotype frequencies Hardy-Weinberg equilibrium Statistical methods for mapping QTL n n What is QTL? Experimental animals Analysis of variance (marker regression) Interval mapping (Expectation Maximization) 10

Alleles n n Alternative forms of a particular sequence Each allele has a frequency,

Alleles n n Alternative forms of a particular sequence Each allele has a frequency, which is the proportion of chromosomes of that type in the population C, G and -- are alleles …ACTCGGTTGGCCTTAATTCGGCCCGGACTCGGTTGGCCTAAATTCGGCCCGG … …ACCCGGTAGGCCTTAATTCGGCCCGG … …ACCCGGTAGGCCTTAATTCGGCC--GGACCCGGTAGGCCTTAATTCGGCCCGG … …ACCCGGTTGGCCTTAATTCGGCCGGGACCCGGTTGGCCTTAATTCGGCCGGG … single nucleotide polymorphism (SNP) allele frequencies for C, G, -- 11

Allele frequency notations n For two alleles n n n Usually labeled p and

Allele frequency notations n For two alleles n n n Usually labeled p and q = 1 – p e. g. p = frequency of C, q = frequency of G For more than 2 alleles n n Usually labeled p. A, p. B, p. C. . . … subscripts A, B and C indicate allele names 12

Genotype n The pair of alleles carried by an individual n n Homozygotes n

Genotype n The pair of alleles carried by an individual n n Homozygotes n n n If there are n alternative alleles … … there will be n(n+1)/2 possible genotypes In most cases, there are 3 possible genotypes The two alleles are in the same state (e. g. CC, GG, AA) Heterozygotes n n The two alleles are different (e. g. CG, AC) 13

Genotype frequencies n n n Since alleles occur in pairs, these are a useful

Genotype frequencies n n n Since alleles occur in pairs, these are a useful descriptor of genetic data. However, in any non-trivial study we might have a lot of frequencies to estimate. p. AA, p. AB, p. AC, … p. BB, p. BC, … p. CC … 14

The simple part n Genotype frequencies lead to allele frequencies. n For example, for

The simple part n Genotype frequencies lead to allele frequencies. n For example, for two alleles: n n n p. A = p. AA + ½ p. AB p. B = p. BB + ½ p. AB However, the reverse is also possible! 15

Hardy-Weinberg Equilibrium n Relationship described in 1908 n n n Shows n allele frequencies

Hardy-Weinberg Equilibrium n Relationship described in 1908 n n n Shows n allele frequencies determine n(n+1)/2 genotype frequencies n n Hardy, British mathematician Weinberg, German physician Large populations Random union of the two gametes produced by two individuals 16

Random Mating: Mating Type Frequencies n Denoting the genotype frequency of Ai. Aj by

Random Mating: Mating Type Frequencies n Denoting the genotype frequency of Ai. Aj by pij, p 112 2 p 11 p 22 p 122 2 p 12 p 222 17

Mendelian Segregation: Offspring Genotype Frequencies p 112 2 p 11 p 22 p 122

Mendelian Segregation: Offspring Genotype Frequencies p 112 2 p 11 p 22 p 122 2 p 12 p 222 1 0 0 0. 5 1 0 0 0. 25 0 0 0. 5 0 0. 25 0. 5 1 18

Required Assumptions n n n Diploid (2 sets of DNA sequences), sexual organism Autosomal

Required Assumptions n n n Diploid (2 sets of DNA sequences), sexual organism Autosomal locus Large population Random mating Equal genotype frequencies among sexes Absence of natural selection 19

Conclusion: Hardy-Weinberg Equilibrium n n Allele frequencies and genotype ratios in a randomly-breeding population

Conclusion: Hardy-Weinberg Equilibrium n n Allele frequencies and genotype ratios in a randomly-breeding population remain constant from generation to generation. Genotype frequencies are function of allele frequencies. n n Equilibrium reached in one generation Independent of initial genotype frequencies Random mating, etc. required Conform to binomial expansion. n (p 1 + p 2)2 = p 12 + 2 p 1 p 2 + p 22 20

Outline n Basic concepts n n n Allele, allele frequencies, genotype frequencies Hardy-Weinberg Equilibrium

Outline n Basic concepts n n n Allele, allele frequencies, genotype frequencies Hardy-Weinberg Equilibrium Statistical methods for mapping QTL n n What is QTL? Experimental animals Analysis of variance (marker regression) Interval mapping 21

Quantitative Trait Locus (QTL) n Definition of QTLs n n Mapping QTLs n n

Quantitative Trait Locus (QTL) n Definition of QTLs n n Mapping QTLs n n The genomic regions that contribute to variation in a quantitative phenotype (e. g. blood pressure) Finding QTLs from data Experimental animals n n Backcross experiment (only 2 genotypes for all genes) F 2 intercross experiment 22

Backcross experiment parental generation n Homozygous genomes Advantage n n first filial (F 1)

Backcross experiment parental generation n Homozygous genomes Advantage n n first filial (F 1) generation Inbred strains Only two genotypes Disadvantage n Relatively less genetic diversity X gamete AB AA AB 23 Karl Broman, Review of statistical methods for QTL mapping in experimental crosses

F 2 intercross experiment parental generation F 1 generation X F 2 generation gametes

F 2 intercross experiment parental generation F 1 generation X F 2 generation gametes AA BB AB 24 Karl Broman, Review of statistical methods for QTL mapping in experimental crosses

Trait distributions: a classical view X 25

Trait distributions: a classical view X 25

QTL mapping n Data n n Phenotypes: yi = trait value for mouse i

QTL mapping n Data n n Phenotypes: yi = trait value for mouse i Genotypes: xik = 1/0 (i. e. AB/AA) of mouse i at marker k (backcross) Genetic map: Locations of genetic markers Goals n n Identify the genomic regions (QTLs) contributing to variation in the phenotype. Identify at least one QTL. Form confidence interval for QTL location. Estimate QTL effects. 26

The simplest method: ANOVA n n n “Analysis of variance”: assumes the presence of

The simplest method: ANOVA n n n “Analysis of variance”: assumes the presence of single QTL For each marker: Split mice into groups according to their genotypes at each marker. Do a t-test/F-statistic Repeat for each typed marker t-test/F-statistic will tell us whethere is sufficient evidence to believe that measurements from one condition (i. e. genotype) is significantly different from another. LOD score (“Logarithm of the odds favoring linkage”) = log 10 likelihood ratio, comparing single-QTL model to the “no QTL anywhere” model. 27

ANOVA at marker loci n Advantages n n Simple. Easily incorporate covariates (e. g.

ANOVA at marker loci n Advantages n n Simple. Easily incorporate covariates (e. g. environmental factors, sex, etc). Easily extended to more complex models. Disadvantages n n Must exclude individuals with missing genotype data. Imperfect information about QTL location. Suffers in low density scans. Only considers one QTL at a time (assumes the presence of a single QTL). 28

Interval mapping [Lander and Botstein, 1989] n Consider any one position in the genome

Interval mapping [Lander and Botstein, 1989] n Consider any one position in the genome as the location for a putative QTL. n For a particular mouse, let z = 1/0 if (unobserved) genotype at QTL is AB/AA. n Calculate P(z = 1 | marker data). n n Need only consider nearby genotyped markers. May allow for the presence of genotypic errors. n Given genotype at the QTL, phenotype is distributed as N(µ+∆z, σ2). n Given marker data, phenotype follows a mixture of normal distributions. 29

IM: the mixture model Nearest flanking markers n n n M 1 QTL M

IM: the mixture model Nearest flanking markers n n n M 1 QTL M 2 0 7 20 Let’s say that the mice with QTL genotype AA have average phenotype µA while the mice with QTL genotype AB have average phenotype µB. The QTL has effect ∆ = µB - µA. What are unknowns? n n µA and µB Genotype of QTL M 1/M 2 99% AB 65% AB 35% AA 35% AB 65% AA 99% AA 30

References n n n Prof Goncalo Abecasis (Univ of Michigan)’s lecture note Broman, K.

References n n n Prof Goncalo Abecasis (Univ of Michigan)’s lecture note Broman, K. W. , Review of statistical methods for QTL mapping in experimental crosses Doerge, R. W. , et al. Statistical issues in the search for genes affecting quantitative traits in experimental populations. Stat. Sci. ; 12: 195 -219, 1997. Lynch, M. and Walsh, B. Genetics and analysis of quantitative traits. Sinauer Associates, Sunderland, MA, pp. 431 -89, 1998. Broman, K. W. , Speed, T. P. A review of methods for identifying QTLs in experimental crosses, 1999. 31