Modeling Gene Trees with ContextFree Grammar Heng Li

  • Slides: 37
Download presentation
Modeling Gene Trees with Context-Free Grammar Heng Li The Wellcome Trust Sanger Institute

Modeling Gene Trees with Context-Free Grammar Heng Li The Wellcome Trust Sanger Institute

Overview • Motivation: build accurate gene trees ✤ Conventional: purely from sequence alignment ✤

Overview • Motivation: build accurate gene trees ✤ Conventional: purely from sequence alignment ✤ This talk: combine with the species phylogeny • Method: model trees with CFG (Context. Free Grammar) ✤ TM-WCFG (Weighted CFG) for tree merge ✤ Taxa-SCFG (Stochastic CFG) for gene evolution

Introduction to Context-Free Grammar

Introduction to Context-Free Grammar

Introduction to CFG • Context-free grammar (CFG) is a transformational grammar • Transformational grammar

Introduction to CFG • Context-free grammar (CFG) is a transformational grammar • Transformational grammar generates strings (or languages) from rules: ✤ Rules: S → SS | a. Sa | b. Sb | aa | bb ✤ Generate baab (derivation): ✤ S⇒b. Sb⇒ba. Sab⇒baa. Saab⇒baab ✤ S⇒SS⇒b. Sb⇒baabb. Sb⇒baab

WCFG and SCFG • WCFG = Weighted Context Free Grammar ✤ Weights assigned to

WCFG and SCFG • WCFG = Weighted Context Free Grammar ✤ Weights assigned to each rule: v→yz • SCFG = Stochastic Context Free Grammar ✤ Probability assigned to each rule: v→yz ✤ ∀v, ∑y, z p(y, z|v)=1 • Rules are applied independently

HMM vs. SCFG • A path of hidden states corresponds to a derivation (parse

HMM vs. SCFG • A path of hidden states corresponds to a derivation (parse tree). • SCFG is a superset of HMM (CFG is a superset of FSA): ✤ Viterbi vs. CYK: the optimal derivation ✤ Forward vs. inside: probability of the observation ✤ Baum-Welch vs. Inside-Outside: EM parameter estimation

Tree Merge WCFG

Tree Merge WCFG

Tree merge problem • Merge several trees with identical leaf set into an optimal

Tree merge problem • Merge several trees with identical leaf set into an optimal tree. • Optimum means: ✤ fewer duplications and losses ✤ higher overall bootstrap support

Overview: tree merge • Input: several gene trees with identical leaves • Output: a

Overview: tree merge • Input: several gene trees with identical leaves • Output: a gene tree ✤ with each branch coming from one of the input ✤ that optimizes a certain objective function • Procedure: ✤ Calculate bootstrap & infer gene duplications and losses ✤ Generate WCFG rules and their weights from the input ✤ Top-down CYK to find the optimum

W=2 W=5

W=2 W=5

Rules W=6 A⇒‹rat, mou 1› (dup, loss t ) (0, 0) 0 B⇒‹A, mou

Rules W=6 A⇒‹rat, mou 1› (dup, loss t ) (0, 0) 0 B⇒‹A, mou 2›|‹rat, E (1, 1)|(0, 0 2|0 › ) C⇒‹B, mou 2› (0, 0) 0 (0, 0)|(1, 2 D⇒‹C, chick›|‹B, F› 0|3 ) E⇒‹mou 1, mou 2› (1, 0) 1 F⇒‹hum, chick› (0, 1) 1 D⇒‹B, F›⇒‹‹A, mou 2›, ‹hum, chick›› ⇒‹‹‹rat, mou 1›, mou 2›, ‹hum, chick››

Rules W=1 A⇒‹rat, mou 1› (dup, loss t ) (0, 0) 0 B⇒‹A, mou

Rules W=1 A⇒‹rat, mou 1› (dup, loss t ) (0, 0) 0 B⇒‹A, mou 2›|‹rat, E (1, 1)|(0, 0 2|0 › ) C⇒‹B, mou 2› (0, 0) 0 (0, 0)|(1, 2 D⇒‹C, chick›|‹B, F› 0|3 ) E⇒‹mou 1, mou 2› (1, 0) 1 D⇒‹C, chick›⇒‹‹B, hum›, chick›⇒‹‹‹rat, E›, h F⇒‹hum, chick› (0, 1) 1 um›, chick› ⇒‹‹‹rat, ‹mou 1, mou 2››, hum›, chick›

W=2 W=1 W=5

W=2 W=1 W=5

Top-down CYK • Cocke-Younger-Kasami (CYK) algorithm • Find the optimum derivation • γ(v): optimum

Top-down CYK • Cocke-Younger-Kasami (CYK) algorithm • Find the optimum derivation • γ(v): optimum sum of weights up to state v • t(y, z|v): weight of rule v→‹y, z› • γ(v)=max {γ(y)+γ(z)+t(y, z|v)} • Calculated recursively y, z

Weights Independency • A weight for v→‹y, z› must be a function of (v,

Weights Independency • A weight for v→‹y, z› must be a function of (v, y, z) • Weights should be additive • Possible weights: ✤ bootstrap ✤ number of duplications and losses.

Taxonomy SCFG

Taxonomy SCFG

Overview: taxa-SCFG • Taxa-SCFG rules: from a species tree • Taxa-SCFG generates a gene

Overview: taxa-SCFG • Taxa-SCFG rules: from a species tree • Taxa-SCFG generates a gene tree ✤ alternative way to infer gene duplications and losses ✤ calculate the probability of a gene tree • A derivation corresponds to a reconciliation ✤ A reconciliation: a consistent assignment of speciations and gene duplications on the gene tree.

Taxa-tree ⇒ rules A duplication A⇒‹A, A› E E⇒‹E, E› H M C H⇒‹H,

Taxa-tree ⇒ rules A duplication A⇒‹A, A› E E⇒‹E, E› H M C H⇒‹H, H› M⇒‹M, M› C⇒‹C, C› speciation A⇒‹E, C› E⇒‹H, M› H⇒g. H M⇒g. M C⇒g. C A⇒ε H⇒ε M⇒ε C⇒ε loss E⇒ε

Rules ⇒ gene trees

Rules ⇒ gene trees

Rules ⇒ gene trees

Rules ⇒ gene trees

Taxa-SCFG rules Type W is an internal node W is an external node Rule

Taxa-SCFG rules Type W is an internal node W is an external node Rule Probabilit Meaning y W→‹W, W› duplicatio pd(1 -pl) W→‹W 1, W 2 n (1 -pd)(1 -pl) › speciation pl W→ε loss duplicatio W→‹W, W› pd(1 -pl) n W→g. W (1 -pd)(1 -pl) speciation W→ε pl loss

Prob. of a reconciliation ✤ nd: # duplications ✤ nl: # losses ✤ m:

Prob. of a reconciliation ✤ nd: # duplications ✤ nl: # losses ✤ m: # leaves in the species tree ✤ n: # leaves in the gene tree • The most probable π * corresponds to the most parsimonious reconciliation

Prob. of a gene tree • P(G|X, S)=P(G|X)∙P(G|S) ✤ P(G|X): prob. from seq. alignment

Prob. of a gene tree • P(G|X, S)=P(G|X)∙P(G|S) ✤ P(G|X): prob. from seq. alignment X ✤ P(G|S): prob. from species tree S • P(G|S) can be approximated as P(G, π*|S) where π* is the most probable reconciliation • In theory, P(G|S) can be calculated with an alternative SCFG model

Alternative Model Type Rule Probabilit Meaning y W→‹W, W› pw W is an W→‹W

Alternative Model Type Rule Probabilit Meaning y W→‹W, W› pw W is an W→‹W 1, W 2 1 -pw-2 qw internal › qw node W→W 1 qw W→W 2 W is an W→‹W, W› pw external W→g. W 1 -pw node duplicatio n speciation loss duplicatio n speciation

Implementation & Evaluation

Implementation & Evaluation

Reconstruct gene trees • Species aware ML (find a gene tree G that maximizes

Reconstruct gene trees • Species aware ML (find a gene tree G that maximizes P(G|X, S)): ✤ Calculate P(G|X) with Phy. ML ✤ Infer π* with parsimonious reconciliation ✤ Maximize P(G|X)∙P(G, π*|S) with a Phy. ML-typed search • Tree merge (five trees): ✤ NJ-d. S, NJ-d. N, NJ-p, Phy. ML-HKY, Phy. ML-WAG *

Implementation • Implemented in NJTREE (or whatever) • Used in Tree. Fam and Ensembl-Compara

Implementation • Implemented in NJTREE (or whatever) • Used in Tree. Fam and Ensembl-Compara ✤ about 20, 000 gene trees ✤ gene trees with several hundred of leaves • Website: ✤ http: //treesoft. sourceforge. net ✤ http: //www. treefam. org

Evaluation • Data set: 197 curated gene trees from Tree. Fam ✤ 8182 internal

Evaluation • Data set: 197 curated gene trees from Tree. Fam ✤ 8182 internal nodes ✤ Genes mainly from vertebrates, together with worms, flies and a few plants and yeasts • Criteria: number of duplications and losses - the fewer the better

Type NJ-WAG NJ-HKY NJ-d. S NJ-d. N NJ-d. M Prot. Pars ML-WAG* ML-HKY* Merged*

Type NJ-WAG NJ-HKY NJ-d. S NJ-d. N NJ-d. M Prot. Pars ML-WAG* ML-HKY* Merged* #dup 2709 2520 3143 2307 1899 2143 2270 1675 2117 1749 1729 1575 #loss 11148 8698 14882 8199 5037 6806 7696 3973 5974 3761 3895 3083 comment NJ-d. N+d. S tree merge species-aware

Acknowledgement • Sanger Institute ✤ Richard Durbin ✤ Avril Coghlan, Alan Moses, Jean-Karim ✤

Acknowledgement • Sanger Institute ✤ Richard Durbin ✤ Avril Coghlan, Alan Moses, Jean-Karim ✤ Durbin research group • Others ✤ Lanchlan Coin ✤ Jue Ruan

Thank You!

Thank You!

Advanced Taxa. SCFG • CYK for parsimonious reconciliation • Inside algorithm for the probability

Advanced Taxa. SCFG • CYK for parsimonious reconciliation • Inside algorithm for the probability of a gene tree P(G|S) • Inside-Outside for parameter estimation

Seq-SCFG • Rules from a known phylogenetic tree • Generate the sequences at the

Seq-SCFG • Rules from a known phylogenetic tree • Generate the sequences at the leaf nodes • A derivation corresponds to an assignment of ancestral sequences

Seq-SCFG Type start symbol Rule Probability S 0→(1, a) q(a) (i, a)→‹(j, b), (k,

Seq-SCFG Type start symbol Rule Probability S 0→(1, a) q(a) (i, a)→‹(j, b), (k, pij(b|a)pik(c|a)δi(j, i is internal c)› k) (i, a)→a i is ✤external δi(j, k)=1 iff edges (i, j) and (i, k) are present 1 in the known tree. ✤ a is a terminal symbol. For example: a∈{A, C, G, T} ✤ i is a node, internal or external

Seq-SCFG algorithms • CYK ⇒ Maximum parsimony • Inside = Felsenstein’s pruning algorithm •

Seq-SCFG algorithms • CYK ⇒ Maximum parsimony • Inside = Felsenstein’s pruning algorithm • Inside-Outside for the posterior distribution of ancestral states and parameter est. • Extension to the previous: ✤ Felsenstein’s algorithm on multiple trees ✤ TM-WCFG is Seq-SCFG

Why gene trees? • Key to gene annotations across species • Understand gene functions

Why gene trees? • Key to gene annotations across species • Understand gene functions • Help other evolutionary studies ✤ selection of genes ✤ intron evolution ✤ whole genome duplication • Infer species trees