Models of Sequence Evolution JC 69 JukesCantor 1969

  • Slides: 47
Download presentation
Models of Sequence Evolution

Models of Sequence Evolution

JC 69: Jukes-Cantor (1969) Model Assumes all bases interchange with equal probabilities, equal base

JC 69: Jukes-Cantor (1969) Model Assumes all bases interchange with equal probabilities, equal base frequencies

F 81: Felsenstein (1981) Model Assumes all bases interchange with equal probabilities, base frequencies

F 81: Felsenstein (1981) Model Assumes all bases interchange with equal probabilities, base frequencies may vary.

HKY 85: Hasegawa, Kishino, Yano (1985) Model Assumes unequal transition and transversion probabilities, base

HKY 85: Hasegawa, Kishino, Yano (1985) Model Assumes unequal transition and transversion probabilities, base frequencies may vary.

REV: General Time Reversible Model Assumes unequal probabilities for all substitution types, base frequencies

REV: General Time Reversible Model Assumes unequal probabilities for all substitution types, base frequencies may vary.

Models applied to a dataset of 13 HIV pol sequences (273 nt) Model #

Models applied to a dataset of 13 HIV pol sequences (273 nt) Model # par. ln Lik JC 69 1 -1227. 45 F 81 4 -1187. 54 K 2 P 2 -1210. 49 HKY 85 5 -1165. 70 REV 9 -1151. 63

Other Forms of Rate Heterogeneity • • • Variation from gene to gene Variation

Other Forms of Rate Heterogeneity • • • Variation from gene to gene Variation from site to site within a gene Synonymous vs synonymous rates Spatial rate heterogeneity Variation from lineage to lineage • Correlations among sets of sites

Site-to-site rate heterogeneity

Site-to-site rate heterogeneity

Gamma models of site-to-site heterogeneity Hierarchical model: • Evolutionary rates at individual sites are

Gamma models of site-to-site heterogeneity Hierarchical model: • Evolutionary rates at individual sites are drawn from a gamma distribution • Given the rate at a particular site, sequence evolution follows one of the previously discussed models Original idea: Uzzel and Corbin (1971) First likelihood treatment was by Yang (1993)

REV+G: General Time Reversible Model with Gamma Rate Heterogeneity (Yang 1993)

REV+G: General Time Reversible Model with Gamma Rate Heterogeneity (Yang 1993)

“All” we are doing here is integrating the likelihood function over all possible rate

“All” we are doing here is integrating the likelihood function over all possible rate values, with those values being weighted according to probabilities assigned by the gamma distribution.

 • Calculating the integral in the continuous case is very expensive. • Yang

• Calculating the integral in the continuous case is very expensive. • Yang (1994) suggested “discretizing” the gamma distribution, and using the discrete form of the likelihood function. • The cost of calculation increases only linearly with the number of rate categories, N.

13 HIV pol sequences

13 HIV pol sequences

4 alpha-spectrin sequences

4 alpha-spectrin sequences

Correlations among sites • Codons • Dinucleotides (secondary structure) General idea: move from 4

Correlations among sites • Codons • Dinucleotides (secondary structure) General idea: move from 4 -state nucleotide models to 16 or 64 -state models.

MG 94: Muse and Gaut (1994)

MG 94: Muse and Gaut (1994)

GY 94: Goldman and Yang(1994)

GY 94: Goldman and Yang(1994)

 • This approach can be combined with any nucleotide model • Rate heterogeneity

• This approach can be combined with any nucleotide model • Rate heterogeneity can be added in the same way as with nucleotide models • Account for correlations among nucleotide sites within codons • Avoids the problematic notion of “degeneracy classes” • Necessary for rigorous estimates of synonymous and nonsynonymous substitution rates.

Muse and Gaut (1994) model

Muse and Gaut (1994) model

Consider the result of allowing gamma variation in rates over codons: q While they

Consider the result of allowing gamma variation in rates over codons: q While they are allowed to have different magnitudes, the two classes of rates have the same distribution. q Synonymous rates are likely to be less variable over sites than are nonsynonymous rates.

Site-to-site rate variation, transitions and transversions with independent distributions A A C G T

Site-to-site rate variation, transitions and transversions with independent distributions A A C G T

Muse and Gaut (1994) modification

Muse and Gaut (1994) modification

Each site in the sequence has a (random) synonymous rate, and a (random) nonsynonymous

Each site in the sequence has a (random) synonymous rate, and a (random) nonsynonymous rate, drawn from some bivariate distribution, The likelihood is again integrated with respect to f:

As before, the likelihood function is discretized for computational feasibility: The discretization process can

As before, the likelihood function is discretized for computational feasibility: The discretization process can be tricky in general. In all that follows, we assume that are independent gamma random variables, which allows discretization of each axis separately.

Goals of molecular evolutionary analyses • Understand the structure of – Are parameter values

Goals of molecular evolutionary analyses • Understand the structure of – Are parameter values (i. e. , rates) equal among different branches? – Is the structure of (e. g. , TS/TV ratio) the same for different branches? – Are the values of such parameters “related” among different genes?

Likelihood function for homologous DNA sequences x A B G is the collection of

Likelihood function for homologous DNA sequences x A B G is the collection of all parameters affecting the evolution of sequences A, B, and G. is the collection of all data (sequences A, B, and G).

Relative Rate Tests Are evolutionary rates the same along two lineages? A B G

Relative Rate Tests Are evolutionary rates the same along two lineages? A B G

Versions of the relative rate test • Distance based (Wu and Li 1985) –

Versions of the relative rate test • Distance based (Wu and Li 1985) – For 2 clades (Li and Bousquet 1992) • Likelihood ratio (Muse and Weir 1992) • Nonparametric (Tajima 1993)

Likelihood-based RR Test A B • Maximize L assuming • Maximize L without constraints

Likelihood-based RR Test A B • Maximize L assuming • Maximize L without constraints • LRT has a chi-squared distribution if rates are equal Note: Use of outgroup insures that the unknown divergence time is irrelevant. G

Distance-based RR Tests a A g b B G

Distance-based RR Tests a A g b B G

Nonparametric RR Test A B G

Nonparametric RR Test A B G

ndh. F rbc. L Nonsyn Syn

ndh. F rbc. L Nonsyn Syn

Locus A Locus B Locus effect Lineage X Locus effect

Locus A Locus B Locus effect Lineage X Locus effect

Relative Ratio Tests • Are “branch lengths” proportional among loci? • Muse and Gaut

Relative Ratio Tests • Are “branch lengths” proportional among loci? • Muse and Gaut 1997; Muse et al. 1997; Huelsenbeck et al. 1997; Yang 1995

Relative Ratio Test The null hypothesis is that the relative proportions of branch lengths

Relative Ratio Test The null hypothesis is that the relative proportions of branch lengths are the same for all loci. The proportion need not be known a priori. 1 1 2 2 3 3 4 4