Alignment scoring matrices and evolution ORF Graphs types

ORF Graphs: types of edges ATG TAG (single exon gene) ATG GT (initial coding

Conceptual framework for gene finding with ORF graphs 1. Given an input sequence S,

Common assumptions • No overlapping genes – reasonably safe • No nested genes –

Common assumptions (continued) • No non-canonical splice sites or start codons – GTG, TTG

Common assumptions (continued) • Optimal parse only – useful, but if we allow the

Common assumptions (continued) • No split start/stop codons – introns can occur right in

Common assumptions (continued) • No selenocysteine codons – encoded by TGA • No ambiguity

The (Eukaryotic) Gene Prediction Problem exons ATG…. . GT…. . AG…………GT…. AG……. . TAG

Some Eukaryotic Gene-finding Programs Gen. Scan Twin. Scan Genie SGP 2 TWAIN Augustes JIGSAW

oc Glimm er. M Grail. E X MORG P A Gene. M N ark

Gene Finding Strategies Gene finding programs can be classified into several types: (1) ad

Review of Probability Theory P(x) denotes the probability of event x occurring. P(x)=0. 25

Review of Probability Theory If x and y are mutually exclusive (i. e. ,

Ab initio Gene Finding = Modeling Computational gene prediction is generally carried out as

A Model Generates Sequences and Gene Structures The underlying model of a gene finder

The Model Parameters Affect the Model’s Output The parameters to M determine the statistical

Tuning the Model to Obtain Optimal Behavior The parameters to M can be tuned

Using the Model to Predict Gene Structure Given an input sequence, we can ask

Slides: 31

Download presentation

Alignment scoring matrices and evolution

ORF Graphs: types of edges ATG TAG (single exon gene) ATG GT (initial coding exon) GT AG (intron) AG GT (internal exon) AG TAG (terminal coding exon) TAG ATG (intergenic region)

Conceptual framework for gene finding with ORF graphs 1. Given an input sequence S, compute the ORF graph G for S. 2. Score the vertices and edges in G using some scoring strategy or function, f. 3. Extract the highest-scoring valid parse from G according to f.

Common assumptions • No overlapping genes – reasonably safe • No nested genes – reasonably safe • No partial genes – limiting; needs to be relaxed for real gene finders

Common assumptions (continued) • No non-canonical splice sites or start codons – GTG, TTG common start sites in bacteria – GC-AG and AT-AC introns present in plants and animals • No frame shifts or sequencing errors – OK for finished sequence

Common assumptions (continued) • Optimal parse only – useful, but if we allow the gene finder more guesses it may do better • Constraints on exon/intron lengths – max sizes needed for practical reasons – but, human introns can be >100, 000 bp – for now we just miss those

Common assumptions (continued) • No split start/stop codons – introns can occur right in the middle of an ATG or TAG/TGA/TAA! – (most of) today’s gene finders will miss these • No alternative splicing – some species have an average of 2 alternative splicing products per gene – researchers actively working on this

Common assumptions (continued) • No selenocysteine codons – encoded by TGA • No ambiguity codes – R = purine = A or G – Y = pyrimidine = C or T – other codes for all possible ambiguous base calls

The (Eukaryotic) Gene Prediction Problem exons ATG…. . GT…. . AG…………GT…. AG……. . TAG UTR introns Gene prediction is the problem of parsing a sequence into nonoverlapping coding segments (CDSs) consisting of exons separated by introns. Untranslated regions (UTR) are rarely predicted. This parsing problem can be visualized as one of choosing the best path through the graph of all open reading frames (ORFs): slide courtesy of Bill Majoros

Some Eukaryotic Gene-finding Programs Gen. Scan Twin. Scan Genie SGP 2 TWAIN Augustes JIGSAW VEIL Grail. EXP SNAP Genome. Scan FGenes. H SLAM Glimmer. HMM Eu. Gène Unveil Morgan Exonomy Gene. Zilla Double. Scan SGP 1 Glimmer. M Gene. Mark HMMgene Grail Phat GAZE gray boxes = developed in Salzberg lab slide courtesy of Bill Majoros

oc Glimm er. M Grail. E X MORG P A Gene. M N ark FGene s. H gy lo mo Ho TWAIN SLAM AN C S N I W T Scan e m o n e G can S e l b u o D SGP-1 SGP-2 Jigsa Ense w mbl GAZ E ad h HM GH Ms MM Gene. Z il Gen. Sc la an Phat SNAP Glimm er. HM M Augus tus Genie Exono my Combi s Unveil ne HMMge VEIL ners Similarities Between Gene-finding Programs sed -ba slide courtesy of Bill Majoros

Gene Finding Strategies Gene finding programs can be classified into several types: (1) ad hoc. These apply an ad hoc scoring function to the set of all ORFs and then predict only those ORFs scoring above a predefined threshold. Examples are GRAIL and Glimmer. M. (2) probabilistic. These adopt a rigorous probabilistic model of sequence structure and choose the most probable parse according to that probabilistic model. Examples are Gen. Scan and Gene. Zilla. (3) homology-based. These utilize evidence in the form of homology. These can be either ad hoc (eg. , Grail. EXP) or probabilistic (eg. , Twin. Scan, Slam, Twain). (4) combiners. These combine multiple forms of evidence, such as the predictions of other gene finders, and use ad hoc methods to arrive at a consensus prediction. Examples include Ewan Birney’s Ensembl and Jonathan Allen’s JIGSAW. slide courtesy of Bill Majoros

Probabilistic Gene Finders

Review of Probability Theory P(x) denotes the probability of event x occurring. P(x)=0. 25 means, for example, that x occurs 25% of the time. P(x)=1 implies that the event x is certain to occur. P(x)=0 implies that x cannot occur. The probability of events x and y both occurring is denoted by: P(x, y) or P(x y) (joint probability) P(x | y) denotes the probability of event x occurring, given that y has occurred, or given that y is a true statement. If P(y)≠ 0 then: P(x | y) = P(x, y) / P(y) (conditional probability) P(x, y) = P(x | y) × P(y) (multiplication rule) so that: If x and y are independent then: P(x, y) = P(x) × P(y), P(x | y) = P(x). (independence) slide courtesy of Bill Majoros

Review of Probability Theory If x and y are mutually exclusive (i. e. , can’t both happen), then: (the addition rule) P(x y) = P(x) + P(y). If an experiment is guaranteed to yield one of a set of mutually exclusive events {x 1, x 2, …, xn} then n Σ P(x ) = 1 i=1 (partitioning rule) i If a set of events {x 1, x 2, …, xn} are all pairwise mutually independent then: n P(x 1, x 2, …, xn) = Π P(xi) (independence) i=1 PM(x)=P(x|M) is an estimate of P(x) according to model M. (prob. model) slide courtesy of Bill Majoros

Ab initio Gene Finding = Modeling Computational gene prediction is generally carried out as follows: 1. We formulate a mathematical model M which describes some method for generating DNA sequences and their gene structures. That is, M generates pairs of the form (S, ) for sequence S and gene structure . 2. Using a set T of known genes, we customize or “train” the model so that the sequences and gene structures which M generates have the same statistical properties as the real genes in T. 3. Given an un-annotated sequence S, we pretend that M generated S, even though we know that it did not. (Evolution and the cellular machinery generated it, not our model!) 4. Since we assume that M generated S, we can determine precisely how likely it is for M to have generated it—i. e. , the precise intron/exon boundaries that M would have imposed while it was generating S. slide courtesy of Bill Majoros

A Model Generates Sequences and Gene Structures The underlying model of a gene finder generates both sequences and their gene structures. These can be denoted as pairs of the form (S, ) for sequence S and gene structure M exon 1 exon 2 exon 3 AGCTAGCAGTCGATCATGGCATTATCGGCCGTAGTACGTAGCAGTAGCTAGTAGCAGTCGATAGTA sequence slide courtesy of Bill Majoros

The Model Parameters Affect the Model’s Output The parameters to M determine the statistical properties of the sequences and gene models which it generates (both the structural properties of the gene models and frequences of individual bases). M AGCTAGCAGTCGATCATGGCATTATCGGCCGTAGTACGTAGCAGTAGCTAGTAGCAGTCGATAGTA 6, 23, 843, 924… M TAGGCTCTATTAGCGCTATGCTACGTTATATTCTGATGTGTGATCGTATCTATATATCGATCTAGG 45, 2, 8214, 32… M CCCTATCGCGCGCTATCACACTACGCGTCATTATCTTACTGAGCGCTATCGTAT 11, 413, 7, 235, 8… slide courtesy of Bill Majoros

Tuning the Model to Obtain Optimal Behavior The parameters to M can be tuned to make its outputs most similar to the set of known genes: known AGCTAGCAGTCGATCATGGCATTATCGGCCGTAGTACGTAGCAGTAGCTAGTAGCAGTCGATAGTA M modify parms CGCGCTATCGATCATCTGCGATCGTATATGCTACGGTCGTAGCTGATCGATCGC 6, 23, 843, 924… AGCTAGCAGTCGATCATGGCATTATCGGCCGTAGTACGTAGCAGTAGCTAGTAGCAGTCGATAGTA M modify parms predicted Still dissimilar TGCTGCTATATGCTACGAGCATCTAGCTGACTTATCGCGCGCTAGCATCGATCTAGCG 45, 2, 8214, 32… AGCTAGCAGTCGATCATGGCATTATCGGCCGTAGTACGTAGCAGTAGCTAGTAGCAGTCGATAGTA M 11, 413, 7, 235, 8… similar AGCTTTCAGTCGATCCCGGCATTATCGGCCGTAGCCCGTAGGGGTAGCTAGTACGCATCGATAGTA slide courtesy of Bill Majoros

Using the Model to Predict Gene Structure Given an input sequence, we can ask what gene structure the model is most likely to have generated at the same time that it generated that sequence: parameter file Gene Finder XYZ 11, 413, 7, 235, 8… M sequence S AGCTAGTACG… Suppose M was invoked 1, 000, 000 (Si, i) times. Collect those pairs (Si, i) where Si=S. Among those pairs, which gene structure is most common? Emit that . gene structure (maximum a posteriori, MAP) slide courtesy of Bill Majoros