Modelling genomes Gil Mc Vean Department of Statistics

Why would we want to model a genome? • To identify genes – Protein-coding

EMISSIONS STATES A rather simple model for a protein-coding gene Non-coding DNA T: C:

Define model Explore properties Estimate parameters from data Test goodness-of-fit Refine model A ‘genome’

Hidden Markov Models in bioinformatics • The model of a gene just described can

Making useful HMMs in bioinformatics • To be useful, HMMs for genes have to

Searching for homology • If we compare human and chimpanzee sequences they are approximately

Alignment of PFEMP 1 proteins from P. falciparum

The simplest problem: aligning two sequences • Suppose we have just two protein sequences

How can we construct an alignment algorithm? • What we want to do is

How does is work? • Suppose residues Xi and Yj are aligned to each

The full algorithm • • We need to consider similar transitions for the cases

The forward algorithm Xi+1 Emission probabilities = ek(Xi+1 ) H H D Transition probabilities

The Viterbi algorithm Xi-1 Xi Xi+1 H H H D D D A traceback

An example • Suppose the gap opening and extension parameters are 0. 2 and

Posterior decoding • Using the forward-backward algorithm we can calculate the posterior probability that

Extending the method • Originally, alignment algorithms (Needleman and Wunsch, 1970; Smith and Waterman,

Doing away with alignment • For most problems, the alignment is not of primary

Gene conversion and var gene diversity in P. falciparum • Multiple alignment methods typically

Mosaic alignment • We could model the n+1 th sequence as a mosaic of

Extensive mosaicism within the var family

Slides: 21

Download presentation

Modelling genomes Gil Mc. Vean Department of Statistics, Oxford

Why would we want to model a genome? • To identify genes – Protein-coding – RNA – Small RNAs • To identify regulatory elements – Transcription factor binding sites – Enhancers • To classify genome content – Repeat DNA – Unique sequence • To understand the processes that shape genomes – – – Mutation Recombination Duplication Rearrangement Natural selection

EMISSIONS STATES A rather simple model for a protein-coding gene Non-coding DNA T: C: A: G: 30% 20% s Start codon ATG: 100% AAA: AAC: AAG: … TTT: Codon 1/61% t TERM TAA: 30% TAG: 40% TGA: 30% Non-coding DNA T: C: A: G: 30% 20%

Define model Explore properties Estimate parameters from data Test goodness-of-fit Refine model A ‘genome’ model is like any other statistical model

Hidden Markov Models in bioinformatics • The model of a gene just described can be thought of as a hidden Markov model (HMM) – The underlying states evolve in a Markov fashion, but we observe features (the DNA sequence) emitted by those states • You will remember that there are lots of nice computational properties of hidden Markov models that we can use for inference – Finding a most likely sequence of states – Calculating posterior probabilities of a given state at a given position • There also various algorithms we can use to estimate parameters of HMMs (e. g. ML estimation by EM) • How would you use the model of a gene to find new genes? – How well do you think it would do?

Making useful HMMs in bioinformatics • To be useful, HMMs for genes have to incorporate many features – Regulatory sequences – Intron-splicing features – Correlations and biases in amino acid and base composition • A REALLY important feature to capture is their evolution – Important parts of genes and genomes evolve slower due to constraint

Searching for homology • If we compare human and chimpanzee sequences they are approximately 98. 8% identical at the DNA level. It is ‘easy’ to identify which parts of the genome in humans correspond to which parts in chimps • If we compare human with, say mouse, we can see some parts that are similar, and other parts where there is only vague or even no obvious similarity. • When measuring evolution, we need to identify regions that are homologous – Homology means similarity by descent • Traditionally, the problem of identifying homology has been intrinsically linked to the problem of alignment

Alignment of PFEMP 1 proteins from P. falciparum

The simplest problem: aligning two sequences • Suppose we have just two protein sequences that we want to align WAKIS WEEKS • In evolution, three types of event can happen – Mutation to new amino acids – Insertion of new amino acids – Deletion of amino acids • We want to work out which amino acids in the two sequences are homologous – i. e. related to each other through shared ancestry W—AKIS WEEK-S What do the ‘-’s really mean?

How can we construct an alignment algorithm? • What we want to do is to look at every possible alignment and choose the one that is ‘best’ • What we have to do is to find an efficient algorithm that can search every possible alignment and that has an objective measure as to what ‘best’ means • A natural approach is to make a model of alignments, parameterise it and find the alignment that maximises the likelihood • Although the problem sounds hard we can solve it using a hidden Markov model structure

How does is work? • Suppose residues Xi and Yj are aligned to each other Xi Yj • Three things could happen next – The next two residues in each sequence could also align – A gap could be introduced in sequence X – A gap could be introduced in sequence Y Xi. Xi+1 (A) • Yj. Yj+1 (B) X i. Yj+1 We can parameterise the probabilities of each event (A) (B) (C) Xi. Xi+1 Y j-

The full algorithm • • We need to consider similar transitions for the cases when residue Xi is aligned to a gap after residue Yj, and when Yj is aligned to a gap after Xi Xi …- Yj Yj-a…Yj …- We need to specify various probabilities – – – • Xi-a…Xi The probability of inserting a gap The probability of extending a gap The probability of finishing the alignment The probability of observing an aligned pair of residues (20 x 20) The probability of observing a residue aligned to a gap (20) Once specified we can use the Viterbi and Forward/Backward algorithms to identify ML alignments, sample from the posterior or calculate posterior probabilities

The forward algorithm Xi+1 Emission probabilities = ek(Xi+1 ) H H D Transition probabilities = qij In alignment the state space is two-dimensional (residue i aligned to residue j)

The Viterbi algorithm Xi-1 Xi Xi+1 H H H D D D A traceback matrix is used to keep track of the best partial alignments

An example • Suppose the gap opening and extension parameters are 0. 2 and 0. 5 respectively. There is a 80% chance of observing a match, a 20/19% chance of observing any given mismatch and a 5% chance of observing each unaligned amino acid (We can ignore termination for the moment) • The BEST alignments are given below, each of which has log likelihood of 16. 84, or 31% of the total likelihood (lnlk = -15. 67). • W—AKIS WA-KIS WEEK-S In many real situations, the best alignment represents only a fraction of the total likelihood

Posterior decoding • Using the forward-backward algorithm we can calculate the posterior probability that any residue is aligned to any other, or that a given residue is in a gap state Y 1 X 2 X 3 X 4 X 5 Y 1 Y 2 Y 3 Y 4 Y 5 X 1 X 2 X 3 X 4 X 5 Conditional on X 2 -Y 3

Extending the method • Originally, alignment algorithms (Needleman and Wunsch, 1970; Smith and Waterman, 1981; Gotoh 1982) were not explicitly defined as hidden Markov models – Finite-state automata (FSA) • There have been many extensions to the original idea – – – • Local alignment Repeat alignment Protein family identification Gene finding Multiple alignment The alignment algorithm is very much a workhorse of bioinformatics, as an alignment is needed or almost all subsequent analyses (e. g. phylogenetic tree reconstruction, population genetic inference) – However, relying on a single alignment is not always a great idea

Doing away with alignment • For most problems, the alignment is not of primary interest • The natural thing to do is to integrate over alignments (as in the FB algorithm) to estimate parameters of interest • The key problem is that there is no computationally efficient algorithm for statistical multiple alignment. All widely-used methods use heuristic approaches

Gene conversion and var gene diversity in P. falciparum • Multiple alignment methods typically assume the sequences are related to each through an evolutionary tree • For the case of multi-gene families, this may not be the case, because gene conversion between copies can lead to mosaic structures • If we wish to learn about the processes of conversion, a natural approach is to model the mosaicism – In the case of var genes, the sequences are so diverged that we also need to consider the problem of alignment

Mosaic alignment • We could model the n+1 th sequence as a mosaic of the previous n • We can calculate the likelihood of observing a given sequence by summing over all possible mosaic structures and their alignment • We can also identify the most likely mosaic structure and calculate the expected number of recombination events – Repeating the procedure for all sequences provides a way of assessing the importance of mosaicism within the family

Extensive mosaicism within the var family