Introduction to Haplotype Estimation StatBiostat 550 The Haplotype

Introduction to Haplotype Estimation Stat/Biostat 550

The Haplotype Problem • Suppose we genotype individuals at a number of tightly linked SNPs. A C G C C T T T G C G A A C C C A G G C

The Haplotype Problem • Suppose we genotype individuals at a number of tightly linked SNPs.

The Haplotype Problem • What do the types on the two chromosomes look like?

Haplotypes: who cares? Many people, for many different reasons… • LD mapping: increase power? • LD mapping: decrease genotyping? • Evolutionary studies: selection, recombination, gene conversion, population structure, …

The Haplotype Problem – potential solutions • Molecular methods • Collect family data • Statistical methods for population data

The Simplest Case • What do the types on the two chromosomes look like?

The Next Simplest Case • What do the types on the two chromosomes look like?

The first difficult case… • What do the types on the two chromosomes look like?

Clark’s Method (1990) • Idea: use information obtained from other individuals in the population to determine the most probable haplotype pair.

1 2 3 Is it this configuration?

1 2 3 …or this one?

1 2 3 This one is more probable.

Clark’s Method (Clark, 1990) • Identify the unambiguous individuals. • Make a list of “known” haplotypes. • Go through list, and see whether ambiguous individuals can be made up from a “known” haplotype plus another “complementary” haplotype. If so, add the complementary haplotype to the list of “known” haplotypes.

Clark’s Method 1 2 3 List of known haps.

Clark’s Method: Problem 1 1 2 3

Clark’s Method: Problem 1 1 2 3 List of known haps.

Clark’s Method: Problem 1 1 List of known haps. 2 3 Answer depends on order list is considered…. … and frequency information is ignored

Clark’s Method: Problem 2 1 2 3

Clark’s Method: Problem 2 1 List of known haps. 2 3 Algorithm can fail to resolve all haplotypes… … because looks only for exact matches

Clark’s Algorithm: Summary • Results may depend on order individuals are considered. • Frequency information is ignored. • May fail to resolve all haplotypes. • Fails to assess uncertainty. • Looks only for exact matches. • Fast and intuitive(? ).

Maximum Likelihood (EM Algorithm) • Idea: find haplotype frequencies (f 1, …f. N) to maximise probability of observed genotype data (g 1, …, gn).

Bayesian version Modify Clark’s algorithm: • Replace single pass through data, with iterative scheme. • Allow for uncertainty in resolution. • Use frequency information. Resulting “naïve Gibbs sampler” produces results similar to EM (Stephens, Smith and Donnelly 2001).

Example 1 List of known haps. 3 1 2 3 Matches 1 known Does not match any Assigned moderate probability

Example 1 List of known haps. 3 1 2 3 Matches 3 known Does not match any Assigned higher probability

Example 1 List of known haps. 3 1 2 3 Does not match any Assigned low probability

Problems with EM/naïve Gibbs • Potentially (very) large number of parameters to estimate, leading to inaccurate estimates. • Can be time-consuming for large problems. • Can “converge” to poor local optima (alleviated by multiple runs).

Further modification • Take into account “near misses”, as well as exact matches. (PHASE v 1. 0: Stephens, Smith and Donnelly 2001)

Example 1 2 3 List of known haps. 3 1 Matches 1 known Differs by 2 from 3 known

Example 1 2 3 List of known haps. 3 1 Matches 3 known Differs by 2 from 1 known

Example 1 List of known haps. 2 3 1 3 Differs by 1 from 3 known Differs by 1 from 1 known How to balance these possibilities?

The key question • What is the conditional distribution of the next haplotype, given a set of known haplotypes?

Example 1 2 Given the above haplotypes, what would you expect the next haplotype to look like?

Qualitative answer • The next haplotype will likely differ by a small number of mutations (possibly 0 mutations) from a (randomly-chosen) existing haplotype. • Use theory (Ewens sampling formula; coalescent theory) to roughly quantify the distribution of the “small number”.

Comparisons on simulated data

Problems • Time-consuming for large problems. • Can “converge” to poor local optima. • Ignores recombination (decay of LD with distance). • How should uncertainty in haplotype estimates be treated?

… to be continued.