Genome Evolution Amos Tanay 2009 Genome evolution Lecture

  • Slides: 37
Download presentation
Genome Evolution. Amos Tanay 2009 Genome evolution Lecture 6: Mutations and variational inference

Genome Evolution. Amos Tanay 2009 Genome evolution Lecture 6: Mutations and variational inference

Genome Evolution. Amos Tanay 2009 Bayesian inference vs. Maximum likelihood estimator Introducing prior beliefs

Genome Evolution. Amos Tanay 2009 Bayesian inference vs. Maximum likelihood estimator Introducing prior beliefs on the process (Alternatively: think of virtual evidence) Computing posterior probabilities on the parameters No prior beliefs MLE Beliefs MAP PME Parameter Space

Genome Evolution. Amos Tanay 2009 KL-divergence Entropy (Shannon) Shannon Kullback-leibler divergence Not a metric!!

Genome Evolution. Amos Tanay 2009 KL-divergence Entropy (Shannon) Shannon Kullback-leibler divergence Not a metric!! KL

Genome Evolution. Amos Tanay 2009 Expectation-Maximization EM maximization Dempster Relative entropy>=0

Genome Evolution. Amos Tanay 2009 Expectation-Maximization EM maximization Dempster Relative entropy>=0

Genome Evolution. Amos Tanay 2009 Expectation-Maximization Decompose over alignment positions Group terms with the

Genome Evolution. Amos Tanay 2009 Expectation-Maximization Decompose over alignment positions Group terms with the same free parameter (weights are essentially the posterior of the parent child – prove it!

Genome Evolution. Amos Tanay 2009 Terminology: Do you know how to define these by

Genome Evolution. Amos Tanay 2009 Terminology: Do you know how to define these by now? Inference Parameter learning Likelihood Total probability/Marginal probability Exact inference/Approximate inference Sampling is the a natural way to do approximate inference Marginal Probability (integration over all space) Marginal Probability (integration over A sample)

Genome Evolution. Amos Tanay 2009 Sources of mutations • Mistakes – Replication errors (point

Genome Evolution. Amos Tanay 2009 Sources of mutations • Mistakes – Replication errors (point mutations) – Recombination errors (mainly indels) • Endogenous DNA Damage – Spontaneous base damage: Deaminations, depurinations – Byproducts of metabolism: Oxygen radicals that damage DNA • Exogenous DNA Damage – UV – Chemicals All of these mechanisms cross talk with the surrounding sequence

Genome Evolution. Amos Tanay 2009 DNA polymerases • replicating DNA • A good polymerase

Genome Evolution. Amos Tanay 2009 DNA polymerases • replicating DNA • A good polymerase domain has a misincorporation rate of 10 -5 (1/100, 000) • Any misincorps are clipped off with 99% efficiency by the “proofreading” activity of the polymerase • Further mismatch repair that works in 99. 9% of the case bring the fidelity of the main Polymerases to -10

Genome Evolution. Amos Tanay 2009 Recombination errors • A consequence of partial homology between

Genome Evolution. Amos Tanay 2009 Recombination errors • A consequence of partial homology between different chromosomal loci • Can introduce translocations if the matching sequences are on different chromsomes • Can introduce inversion or deletion if the matching sequences are on the same chromsome • Can generate duplication or deletions if the matching sequences are in tandem

Genome Evolution. Amos Tanay 2009 Endogenous DNA damage: Deamination of Cytosines NH O 2

Genome Evolution. Amos Tanay 2009 Endogenous DNA damage: Deamination of Cytosines NH O 2 N O H H de. NHn H N H* N O H Cytosine *Thymine has CH 3 here N H Uracil H

Genome Evolution. Amos Tanay 2009 Deamination of Cytosine creates a G-U mismatch Easy to

Genome Evolution. Amos Tanay 2009 Deamination of Cytosine creates a G-U mismatch Easy to tell that U is wrong Deamination of Cytosine creates a G-T mismatch Not easy to tell which base is the mutation. About 50% of the time the G is “corrected” to A resulting in a mutation

Genome Evolution. Amos Tanay 2009 Exogenous DNA damage Chemicals • Food • Benzopyrene –

Genome Evolution. Amos Tanay 2009 Exogenous DNA damage Chemicals • Food • Benzopyrene – smoke UV radiations (Sunlight) Ionizing raidation • radon • Cosmic rays • X rays UV irradiation generate primarily Thymine dimers:

Genome Evolution. Amos Tanay 2009 Repairing DNA damage Direct repair

Genome Evolution. Amos Tanay 2009 Repairing DNA damage Direct repair

Genome Evolution. Amos Tanay 2009 Thymine Dimers can be corrected by a direct repair

Genome Evolution. Amos Tanay 2009 Thymine Dimers can be corrected by a direct repair mechanism Photon

Genome Evolution. Amos Tanay 2009 BER Deaminated bases are repaired by a base excision

Genome Evolution. Amos Tanay 2009 BER Deaminated bases are repaired by a base excision mechanism.

Genome Evolution. Amos Tanay 2009 BER Spontaneously occuring abasic sites are repaired by the

Genome Evolution. Amos Tanay 2009 BER Spontaneously occuring abasic sites are repaired by the same mechanism

Genome Evolution. Amos Tanay 2009 NER Dimeric bases and bulky lesions, e. g. ,

Genome Evolution. Amos Tanay 2009 NER Dimeric bases and bulky lesions, e. g. , large chemical adducts are repaired by Nucleotide excision repair

Genome Evolution. Amos Tanay 2009 Adaptive mutations: Cairns et al. 88 Luria-Delbruk’s observation Experimental

Genome Evolution. Amos Tanay 2009 Adaptive mutations: Cairns et al. 88 Luria-Delbruk’s observation Experimental system: lacz frameshift The experiment suggests adaptive mutations

Genome Evolution. Amos Tanay 2009 The “Mutator” paradigm: Ability to switch to the mutator

Genome Evolution. Amos Tanay 2009 The “Mutator” paradigm: Ability to switch to the mutator phenotype depends on particular DNA repair mechanisms (Double Strand Break repair in E. Coli) Mutator phenotype is suggested to be important in pathogenesis, antibiotic resistance, and in cancer Species occasionally change (adaptively or even by drift) their repair policy/efficiency The resulted substitution landscape must be very complex

Genome Evolution. Amos Tanay 2009 Dynamic Bayesian Networks Conditional probabilities 1 Conditional probabilities 3

Genome Evolution. Amos Tanay 2009 Dynamic Bayesian Networks Conditional probabilities 1 Conditional probabilities 3 2 4 Conditional probabilities T=1 T=2 T=3 T=4 T=5 1 1 1 2 2 2 3 3 3 4 4 4 Synchronous discrete time process

Genome Evolution. Amos Tanay 2009 Context dependent Markov Processes 1 2 3 4 Context

Genome Evolution. Amos Tanay 2009 Context dependent Markov Processes 1 2 3 4 Context determines A markov process rate matrix Any dependency structure make sense, including loops A A A C A When context is changing, computing probabilities is difficult. Think of the hidden variables as the trajectories A A C G A A Continuous time Bayesian Networks Koller-Noodleman 2002

Genome Evolution. Amos Tanay 2009 Modeling simple context in the tree: Phylo. HMM hpaij

Genome Evolution. Amos Tanay 2009 Modeling simple context in the tree: Phylo. HMM hpaij Heuristically approximating the Markov process? hij-1 hij Where exactly it fails? hpaij-1 hkj hpaij hkj+1 hpaij+! hij-1 hij+! Siepel-Haussler 2003

Genome Evolution. Amos Tanay 2009 Log-likelihood to Free Energy • • • We have

Genome Evolution. Amos Tanay 2009 Log-likelihood to Free Energy • • • We have so far worked on computing the likelihood: Computing likelihood is hard. We can reformulate the problem by adding parameters and transforming it into an optimization problem. Given a trial function q, define the free energy of the model as: The free energy is exactly the likelihood when q is the posterior: • Better: when q a distribution, the free energy bounds the likelihood: D(q || p(h|s)) Likelihood

Genome Evolution. Amos Tanay 2009 Energy? ? What energy? • In statistical mechanics, a

Genome Evolution. Amos Tanay 2009 Energy? ? What energy? • In statistical mechanics, a system at temperature T with states x and an energy function E(x) is characterized by Boltzman’s law: • • Z is the partition function: Given a model p(h, s|T) (a BN), we can define the energy using Boltzman’s law • If we think of P(h|s, q):

Genome Evolution. Amos Tanay 2009 Free Energy and Variational Free Energy • The Helmoholtz

Genome Evolution. Amos Tanay 2009 Free Energy and Variational Free Energy • The Helmoholtz free energy is defined in physics as: • This free energy is important in statistical mechanics, but it is difficult to compute, as our probabilistic Z (= p(s)) • The variational transformation introduce trial functions q(h), and set the variational free energy (or Gibbs free energy) to: • The average energy is: • The variational entropy is: • And as before:

Genome Evolution. Amos Tanay 2009 Solving the variational optimization problem Maxmizing U? Focus on

Genome Evolution. Amos Tanay 2009 Solving the variational optimization problem Maxmizing U? Focus on max configurations Maxmizing H? Spread out the distribution • So instead of computing p(s), we can search for q that optimizes the free energy • • This is still hard as before, but we can simplify the problem by restricting q (this is where the additional degrees of freedom become important)

Genome Evolution. Amos Tanay 2009 Simplest variational approximation: Mean Field Maxmizing U? Focus on

Genome Evolution. Amos Tanay 2009 Simplest variational approximation: Mean Field Maxmizing U? Focus on max configurations Maxmizing H? Spread out the distribution • Let’s assume complete independence among r. v. ’s posteriors: • Under this assumption we can try optimizing the qi – (looking for minimal energy!)

Genome Evolution. Amos Tanay 2009 Mean Field Inference • We optimize iteratively: • Select

Genome Evolution. Amos Tanay 2009 Mean Field Inference • We optimize iteratively: • Select i (sequentially, or using any method) • Optimize qi to minimize FMF(q 1, . . , qi, …, qn) while fixing all other qs • Terminate when FMF cannot be improved further • Remember: FMF always bound the likelihood • qi optimization can usually be done efficiently

Genome Evolution. Amos Tanay 2009 Mean field for a simple-tree model Just for illustration,

Genome Evolution. Amos Tanay 2009 Mean field for a simple-tree model Just for illustration, since we know how solve this one exactly: We select a node and optimize its qi while making sure it is a distribution: The energy decomposes, and only few terms are affected: To ease notation, assume the left (l) and right (r) children are hidden

Genome Evolution. Amos Tanay 2009 Mean field for a simple-tree model Just for illustration,

Genome Evolution. Amos Tanay 2009 Mean field for a simple-tree model Just for illustration, since we know how solve this one exactly: We select a node and optimize its qi while making sure it is a distribution:

Genome Evolution. Amos Tanay 2009 Mean field for a phylo-hmm model Now we don’t

Genome Evolution. Amos Tanay 2009 Mean field for a phylo-hmm model Now we don’t know how to solve this exactly, but MF is still simple: hj-1 pai hjpai hj-1 r hjl hj+1 l hji hj-1 r hj+1 i hjr hj+1 r

Genome Evolution. Amos Tanay 2009 Mean field for a phylo-hmm model Now we don’t

Genome Evolution. Amos Tanay 2009 Mean field for a phylo-hmm model Now we don’t know how to solve this exactly, but MF is still simple: hj-1 pai hjpai hj-1 i As before, the optimal solution is derived by making logqi equals the sum of affected terms: hj-1 r hjl hj+1 l hji hj-1 r hj+1 i hjr hj+1 r

Genome Evolution. Amos Tanay 2009 Simple Mean Field is usually not a good idea

Genome Evolution. Amos Tanay 2009 Simple Mean Field is usually not a good idea Why? Because the MF trial function is very crude For example, we said before that the joint posteriors cannot be approximated by independent product of the hidden variables posteriors A/C A A/C C A C

Genome Evolution. Amos Tanay 2009 Exploiting additional structure We can greatly improve accuracy by

Genome Evolution. Amos Tanay 2009 Exploiting additional structure We can greatly improve accuracy by generalizing the mean field algorithm using larger building blocks The approximation specify independent distributions for each loci, but maintain the tree dependencies. We now optimize each tree q separately, given the current other tree potentials. The key point is that optimizing for any given tree is efficient: we just use a modified up-down algorithm

Genome Evolution. Amos Tanay 2009 Tree based variational inference Each tree is only affected

Genome Evolution. Amos Tanay 2009 Tree based variational inference Each tree is only affected by the tree before and the tree after:

Genome Evolution. Amos Tanay 2009 Tree based variational inference We got the same functional

Genome Evolution. Amos Tanay 2009 Tree based variational inference We got the same functional form as we had for the simple tree, so we can use the up-down algorithm to optimize qj.

Genome Evolution. Amos Tanay 2009 Chain cluster variational inference We can use any partition

Genome Evolution. Amos Tanay 2009 Chain cluster variational inference We can use any partition of a BN to trees and derive a similar MF algorithm For example, instead of trees we can use the Markov chains in each species What will work better for us? Depends on the strength of dependencies at each dimension – we should try to capture as much “dependency” as possible