Fisica Computazionale applicata alle Macromolecole Modelli probabilistici per

Fisica Computazionale applicata alle Macromolecole Modelli probabilistici per Sequenze Biologiche Pier Luigi Martelli Università di Bologna gigi@biocomp. unibo. it 051 2094005 338 3991609

PROLOGUE: Pitfalls of standard alignments

Scoring a pairwise alignment A: B: ALAEVLIRLITKLYP ASAKHLNRLITELYP Blosum 62

……………………. Alignment of a family (globins) Different positions are not equivalent

Sequence logos http: //weblogo. berkeley. edu/cache/file 5 h 2 DWc. png The substitution score IN A FAMILY should depend on the position (the same for gaps) For modelling families we need more flexible tools

Probabilistic Models for Biological Sequences • What are they?

Probabilistic models for sequences Generative definition: • Objects producing different outcomes (sequences) with different probabilities Probability • The probability distribution over the sequences space determines the model specificity Sequence space M Generates si with probability P(si | M) e. g. : M is the representation of the family of globins

Probabilistic models for sequences We don’t need a generator of new biological sequences the generative definition is useful as operative definition Associative definition: Probability • Objects that, given an outcome (sequence), compute a probability value Sequence space M Associates probability P(si | M) to si e. g. : M is the representation of the family of globins

Probabilistic models for sequences Most useful probabilistic models are Trainable systems Known examples Pdf estimate (generalization) Probability The probability density function over the sequence space is estimated from known examples by means of a learning algorithm Sequence space e. g. : Writing a generic representation of the sequences of globins starting from a set of known globins

Probabilistic Models for Biological Sequences • What are they? • Why to use them?

Modelling a protein family Probabilistic model Given a protein class (e. g. Globins), a probabilistic model trained on this family can compute a probability value for each new sequence Seq 1 Seq 2 Seq 3 Seq 4 Seq 5 Seq 6 0. 98 0. 21 0. 12 0. 89 0. 47 0. 78 This value measures the similarity between the new sequence and the family described by the model

Probabilistic Models for Biological Sequences • What are they? • Why to use them? • Which probabilities do they compute?

P( s | M ) or P( M | s ) ? A model M associates to a sequence s the probability P( s | M ) This probability answers the question: Which is the probability for a model describing the Globins to generate the sequence s ? The question we want to answer is: Given a sequence s, is it a Globin? We need to compute P( M | s ) !!

Bayes’ rule: Example • A rare disease affects 1 out of 100, 000 people. • A test shows positive – with probability 0. 99 when applied to an ill person, and – with probability 0. 01 when applied to a healthy person. • What is the probability that you have the disease given that you test positive?

Bayes’ rule: Example P(+|ill) = 0. 99 P(+|healthy) = 0. 01 Happy End: More likely the test is incorrect!! P(ill) = 10 -5

Is the pope an alien? Since the probability P(Pope|Human) =1/(6, 000, 000) do this imply that the Pope is not a human being? Beck-Bornholdt HP, Dubben HH, Nature 381, 730 (1996)

P(Pope|Human) is not the same as P(Human|Pope) but P(Alien) ~ 0 So P(Human|Pope) ~ 1. 0 The pope is (probably) not an alien S Eddy and D Mc. Kay’s answer

The A priori probabilities P(M | s) = P(s | M) P(s) A priori probabilities P(M) is the probability of the model (i. e. of the class described by the model) BEFORE we know the sequence: can be estimated as the abundance of the class P(s) is the probability of the sequence in the sequence space. Cannot be reliably estimated!!

Comparison between models We can overcome the problem comparing the probability of generating s from different models P(M 1 | s) P(M 2 | s) = = P(s | M 1) P(s) P(s | M 2) P(M 2) = P(s | M 1) P(s | M 2) P(M 2) Ratio between the abundance of the classes

Null model Otherwise we can score a sequence for a model M comparing it to a Null Model: a model that generates ALL the possible sequences with probabilities depending ONLY on the statistical amino acid abundance P(s | M) S(M, s) = log P(s | N) Sequences NOT belonging to model M Sequences belonging to model M S(M, s) In this case we need a threshold and a statistic for evaluating the significance (E-value, P-value)

The simplest probabilistic models: Markov Models • Definition

Markov Models Example: Weather Register the weather conditions day by day: as a first hypothesis the weather condition in a day depends ONLY on the weather conditions in the day before. C R Define the conditional probabilities F S P(C|C), P(C|R), …. P(R|C)…. . The probability for the 5 -days registration CRRCS P(CRRCS) = P(C)·P(R|C) ·P(R|R) ·P(C|R) ·P(S|C) C: Clouds R: Rain F: Fog S: Sun

Markov Model Stochastic generator of sequences in which the probability of state in position i depends ONLY on the state in position i-1 Given an alphabet C = {c 1; c 2; c 3; ………c. N } a Markov model is described with N×(N+2) parameters {art , a. BEGIN t , ar END; r, t C} arq = P( s i = q| s i-1 = r ) a. BEGIN q = P( s 1 = q ) ar END = P( s T = END | s T-1 = r ) t art + ar END = 1 r t a. BEGIN t = 1 c 2 c 3 c 1 END BEGIN c 4 c. N

Markov Models Given the sequence: s = s 1 s 2 s 3 s 4 s 6 ……………s. T with si C = {c 1; c 2; c 3; ………c. N } P( s | M ) = P( s 1 ) i=2 P( s i | s i-1 ) = = a. BEGIN s 1 Ti=2 as i-1 s i as. T END P(“ALKALI”)= a. BEGIN A a. A L a. L K a. K A a. A L a. L I a. I END

Markov Models: Exercise 1) Fill the non defined values for the transition probabilities 0. 3 0. 4 C 0. 3 0. 5 0. 3 0. 2 R 0. 2 ? ? 0. 1 S S 0. 2 F 0. 3 0. 2 ? ? 0. 4 0. 2 C ? ? 0. 1 0. 0 F 0. 0 0. 3 0. 2 R 0. 0 0. 1 0. 7 0. 4 1. 0 0. 0 S S 0. 8

Markov Models: Exercise 0. 3 0. 4 2) Which model better describes the weather in summer? Which one describes the weather in winter? C 0. 3 0. 5 0. 3 0. 2 R 0. 2 0. 0 0. 1 S S 0. 2 F 0. 3 0. 2 0. 1 0. 4 0. 2 C 0. 3 0. 2 0. 1 0. 0 F 0. 0 R 0. 0 0. 1 0. 7 0. 4 1. 0 0. 0 S S 0. 8

Markov Models: Exercise 3) Given the sequence CSSSCFS Winter 0. 4 C which model gives the higher probability? [Consider the starting probabilities: P(X|BEGIN)=0. 25] 0. 3 0. 5 0. 3 0. 2 0. 4 0. 2 0. 0 0. 1 S S 0. 3 Summer 0. 2 C 0. 3 0. 2 0. 1 0. 0 F 0. 0 R 0. 2 F 0. 3 0. 2 0. 1 R 0. 0 0. 1 0. 7 0. 4 1. 0 0. 0 S S 0. 8

Markov Models: Exercise P (CSSSCFS | Winter) = =0. 25 x 0. 1 x 0. 2 x 0. 3 x 0. 2= =1. 2 x 10 -5 P (CSSSCFS | Summer) = =0. 25 x 0. 4 x 0. 8 x 0. 1 x 1. 0= =6. 4 x 10 -4 Winter 0. 4 0. 3 0. 5 C 0. 3 0. 2 0. 0 0. 1 Summer 0. 3 0. 2 C 0. 1 0. 0 F 0. 2 0. 1 R 0. 0 0. 1 0. 0 S S 0. 3 0. 4 4) Can we conclude that the observation 0. 2 sequence refers to a summer week? R 0. 2 F 0. 3 0. 1 0. 7 0. 4 1. 0 0. 0 S S 0. 8

Markov Models: Exercise P (Seq | Winter) =1. 2 x 10 -5 P(Winter | Seq) 0. 3 0. 5 C P (Seq | Summer) =6. 4 x 10 -4 P(Summer | Seq) Winter 0. 4 0. 3 0. 2 = = 0. 2 0. 0 0. 1 0. 3 0. 2 C 0. 1 0. 0 F 0. 2 0. 1 R 0. 0 0. 1 0. 0 S S Summer 0. 2 P(Seq| Winter) P(Winter) 0. 2 0. 3 0. 4 P(Seq |Summer) P(Summer) R 0. 2 F 0. 3 0. 1 0. 7 0. 4 1. 0 0. 0 S S 0. 8

Simple Markov Model for DNA sequences G A C T DNA: C = {Adenine, Citosine, Guanine, Timine } 16 transition probabilities (12 of which independent) + 4 Begin probabilities + 4 End probabilities. The parameters of the model are different in different zones of DNA They describe the overall composition and the couple recurrences

Example of Markov Models: Gp. C Island GATGCGTCGC G C A T CTACGCAGCG Gp. C Islands Non-Gp. C Islands In the Markov Model of Gp. C Islands a. GC is higher than in Markov Model Non-Gp. C Islands Given a sequence s we can evaluate P (Gp. C | s) = P ( s | Gp. C ) ·P(Gp. C) P (s | Gp. C) ·P(Gp. C) + P (s | non. Gp. C) ·P(non. Gp. C)

The simplest probabilistic models: Markov Models • Definition • Training

Training of Markov Models Let M be the set of parameters of model M. During the training phase, M parameters are estimated from the set of known data D Maximum Likelihood Extimation (ML) ML = argmax P( D | M, ) It can be proved that: nik Frequency of occurrence as counted in the aik = j nij data set D Maximum A Posteriori Extimation (MAP) MAP = argmax P( | M, D ) = argmax [P( D | M, ) P( ) ]

Maximum Likelihood training: Proof Given a sequence s contained in D: s = s 1 s 2 s 3 s 4 s 6 ……………s. T We can count the number of transitions between any to states j and k: njk Where states 0 and N+1 are BEGIN and END Normalisation contstraints are taken into account using the Lagrange multipliers lk

Hidden Markov Models • Preliminary examples

Loaded dice We have 99 regular dice (R) and 1 loaded die (L). R L P(1) 1/6 1/10 P(2) 1/6 1/10 P(3) 1/6 1/10 P(4) 1/6 1/10 P(5) 1/6 1/10 P(6) 1/6 1/2 Given a sequence: 4156266656321636543662152611536264162364261664616263 We don’t know the sequence of dice that generated it. RRRRRLRLRRRRRRRRRRRRLRLRRRRLRRRRLRR

Loaded dice Hypothesis: We chose a different die for each roll Two stochastic processes give origin to the sequence of observations. 1) Choosing the die ( R o L ). 2) Rolling the die The sequence of dice is hidden The first process is assumed to be Markovian (in this case a 0 -order MM) The outcome of the second process depends only on the state reached in the first process (that is the chosen die)

Loaded dice Model 0. 01 0. 99 R L 0. 01 0. 99 Each state (R and L) generates a character of the alphabet C = {1, 2, 3, 4, 5, 6 } The emission probabilities depend only on the state. The transition probabilities describe a Markov model that generates a state path: the hidden sequence (p) The observations sequence (s) is generated by two concomitant stochastic processes

Alcuni esempi semi-seri 1) DEMOGRAFICO Osservabile: Numero di nascite e/o morti in un anno in un luogo Variabile Nascosta: Stato economico (in prima istanza, se consideriamo la “fortuna” economica in un anno un processo casuale, la ricchezza accumulata è un prodesso markoviano ---> possiamo ricostruire lo stato economico a partire dai registri anagrafici? 2) DOCENTE METEOPATICO Osservabile: Media dei voti giornalieri registrati su un registro di un professore meteopatico Variabile Nascosta: Stato meteorologico ---> possiamo ricostruire lo stato meteorologico a partire dal registro del docente?

Alcuni esempi più seri 1) STRUTTURA SECONDARIA Osservabile: Sequenza proteica Variabile Nascosta: Struttura secondaria ---> possiamo predire la struttura secondaria a partire dalla sequenza? 2) ALLINEAMENTI Osservabile: Sequenza proteica Variabile Nascosta: Posizione di ogni residuo nell’allineamento di una famiglia proteica ---> possiamo ricostruire l’allineamento della sequenza alla famiglia proteica a partire dalla sequenza?

Gp. C Island Given a long non-annotated DNA sequence, we want to localise the Gp. C Islands (if they exist) We build a model that unifies the two Markov models for Gp. C Islands and Non-Gp. C Islands. Transitions between any state of the first and of the second one are added G+ C+ G- C- A+ T+ A- T- Gp. C Islands Non-Gp. C Islands Now there is not one-to-one correspondence between states and symbols

Gp. C Island: conditioning events Instead of the model: G C A T Non-Gp. C Islands Can we use a model similar to that of the dice example? Non Gp. C On the alphabet C = {A, G, C, T } Using such a model all the characters of the generated sequence would be independent of the preceding character.

Hidden Markov Models • Preliminary examples • Formal definition

Formal definition of Hidden Markov Models A HMM is a stochastic generator of sequences characterised by: · N states · A set of transition probabilities between two states {akj} akj = P( (i) = j | (i-1) = k ) · A set of starting probabilities {a 0 k} a 0 k = P( (1) = k ) · A set of ending probabilities {ak 0} ak 0 = P( (i) = END | (i-1) = k ) · An alphabet C with M characters. · A set of emission probabilities for each state {ek (c)} ek (c) = P( s i = c | (i) = k ) s: sequence ·Constraints: p: path through the states k a 0 k = 1 ak 0 + j ak j = 1 k c C ek (c) = 1 k

Generating a sequence with a HMM Choose the initial state p (1) following the probabilities a 0 k i=1 Choose the character s i from the alphabet C following the probabilities ek(c) Choose the next state following the probabilities ak j and ak 0 i i +1 No Yes Is the END state choosed? End

Gp. C Island, simple model BEGIN Gpc Island a. YY = 0. 7 e. Y (A) = 0. 1 e. Y (G) = 0. 4 e. Y (C) = 0. 4 e. Y (T) = 0. 1 a 0 Y= 0. 2 a 0 N = 0. 8 a. YN = 0. 2 Y a. Y 0 = 0. 1 a. NY = 0. 1 N a. N 0 = 0. 1 END s : AGCGCGTAATCTG : YYYYYYYNNNNNN ·P( s, | M ) can be easily computed Non- Gpc Island a. NN = 0. 8 e. N (A) = 0. 25 e. N (G) = 0. 25 e. N (C) = 0. 25 e. N (T) = 0. 25

P( s, p | M ) can be easily computed BEGIN Gp. C Island a 0 Y= 0. 2 a. YN = 0. 2 Y a. YY = 0. 7 e. Y (A) = 0. 1 e. Y (G) = 0. 4 e. Y (C) = 0. 4 e. Y (T) = 0. 1 s: A : Y G Y C Y a 0 N = 0. 8 G Y C Y a. N 0 = 0. 1 END G Y a. NN = 0. 8 N a. NY = 0. 1 a. Y 0 = 0. 1 Non- Gp. C Island T Y A N e. N (A) = 0. 25 e. N (G) = 0. 25 e. N (C) = 0. 25 e. N (T) = 0. 25 A N T N C N T N G N Emission: 0. 1 0. 4 0. 1 0. 25 Transition: 0. 2 0. 7 0. 2 0. 8 0. 8 0. 1 Multiplying all the probabilities gives the probability of having the sequence AND the path through the states

Evaluation of the joint probability of the sequence ad the path

Hidden Markov Models • Preliminary examples • Formal definition • Three questions

Gp. C Island, simple model BEGIN Gp. C Island a. YY = 0. 7 e. Y (A) = 0. 1 e. Y (G) = 0. 4 e. Y (C) = 0. 4 e. Y (T) = 0. 1 a 0 Y= 0. 2 a 0 N = 0. 8 Non- Gp. C Island a. YN = 0. 2 Y a. Y 0 = 0. 1 a. NY = 0. 1 N a. N 0 = 0. 1 END s : AGCGCGTAATCTG : ? ? ? ? ·P( s, | M ) can be easily computed ·How to evaluate P ( s | M )? a. NN = 0. 8 e. N (A) = 0. 25 e. N (G) = 0. 25 e. N (C) = 0. 25 e. N (T) = 0. 25

How to evaluate P ( s | M )? BEGIN Gp. C Island a. YY = 0. 7 e. Y (A) = 0. 1 e. Y (G) = 0. 4 e. Y (C) = 0. 4 e. Y (T) = 0. 1 a 0 Y= 0. 2 a 0 N = 0. 8 a. YN = 0. 2 Y a. Y 0 = 0. 1 a. NY = 0. 1 N a. N 0 = 0. 1 END Non- Gp. C Island a. NN = 0. 8 e. N (A) = 0. 25 e. N (G) = 0. 25 e. N (C) = 0. 25 e. N (T) = 0. 25 P ( s | M ) = P( s, p | M ) s: A G C G T A A T C T G 1: Y Y Y Y 2: Y Y Y N 3: Y Y Y N Y 4: Y Y Y N N 5: Y Y Y Y Y N Y Y …………………………………………………… 213 different paths Summing over all the path will give the probability of having the sequence

Resumé: Gp. C Island, simple model BEGIN Gp. C Island a. YY = 0. 7 e. Y (A) = 0. 1 e. Y (G) = 0. 4 e. Y (C) = 0. 4 e. Y (T) = 0. 1 a 0 Y= 0. 2 a 0 N = 0. 8 Non- Gp. C Island a. YN = 0. 2 Y a. Y 0 = 0. 1 a. NY = 0. 1 N a. N 0 = 0. 1 END s : AGCGCGTAATCTG : ? ? ? ? ·P( s, | M ) can be easily computed ·How to evaluate P ( s | M )? ·Can we show the hidden path? a. NN = 0. 8 e. N (A) = 0. 25 e. N (G) = 0. 25 e. N (C) = 0. 25 e. N (T) = 0. 25

Can we show the hidden path? BEGIN Gp. C Island a. YY = 0. 7 e. Y (A) = 0. 1 e. Y (G) = 0. 4 e. Y (C) = 0. 4 e. Y (T) = 0. 1 a 0 Y= 0. 2 a 0 N = 0. 8 a. YN = 0. 2 Y a. Y 0 = 0. 1 a. NY = 0. 1 N a. N 0 = 0. 1 END Non- Gp. C Island a. NN = 0. 8 e. N (A) = 0. 25 e. N (G) = 0. 25 e. N (C) = 0. 25 e. N (T) = 0. 25 p* = argmax p [ P( p | s, M ) ] = argmax p [ P( p , s | M ) ] s: A G C G T A A T C T G 1: Y Y Y Y 2: Y Y Y N 3: Y Y Y N Y 4: Y Y Y N N 5: Y Y Y Y Y N Y Y …………………………………………………… 213 different paths Viterbi path: path that gives the best joint probability

Can we show the hidden path? A Posteriori decoding For each position choose the state p (t) : p (i) = argmax k [ P( p (i) = k| s, M ) ] The contribution to this probability derives from all the paths that go through the state k at position i. The A posteriori path can be a non-sense path (it may not be a legitimate path if some transitions are not permitted in the model)

Gp. C Island, simple model BEGIN Gp. C Island a. YY = ? e. Y (A) = ? e. Y (G) = ? e. Y (C) = ? e. Y (T) = ? a 0 Y= ? a 0 N = ? a. YN = ? Y a. Y 0 = ? a. NY = ? N a. N 0 = ? END Non- Gp. C Island a. NN = ? e. N (A) = ? e. N (G) = ? e. N (C) = ? e. N (T) = ? s : AGCGCGTAATCTG : YYYYYYYNNNNNN ·P( s, | M ) can be easily computed ·How to evaluate P ( s | M )? ·Can we show the hidden path? ·Can we evaluate the parameters starting from known examples?

Can we evaluate the parameters starting from known examples? BEGIN a 0 Y= ? Gp. C Island a. YN = ? Y a. YY = ? e. Y (A) = ? e. Y (G) = ? e. Y (C) = ? e. Y (T) = ? s: A : Y Emission: Transition: G Y C Y a 0 N = ? G Y C Y a. N 0 = ? END G Y a. NN = ? N a. NY = ? a. Y 0 = ? Non- Gp. C Island T Y A N e. Y (A) = ? e. Y (G) = ? e. Y (C) = ? e. Y (T) = ? A N T N C N T N G N e. Y (A) e. Y (G) e. Y (C) e Y(G) e. Y (C) e. Y (G) e. Y (T) e. N (A) e. N (T) e. N (C) e. N (T) e. N (G) a 0 Y a. YY a. YY a. YN a. NN a. NN a. N 0 How to find the parameters e and a that maximises this probability? How if we don’t know the path?

Hidden Markov Models: Algorithms • Resumé • Evaluating P(s | M): Forward Algorithm

Computing P( s, p | M ) for each path is a redundant operation BEGIN Gp. C Island a 0 Y= 0. 2 a. YN = 0. 2 Y a. YY = 0. 7 e. Y (A) = 0. 1 e. Y (G) = 0. 4 e. Y (C) = 0. 4 e. Y (T) = 0. 1 s: A : Y G Y C Y a 0 N = 0. 8 G Y C Y a. N 0 = 0. 1 END G Y a. NN = 0. 8 N a. NY = 0. 1 a. Y 0 = 0. 1 Non- Gp. C Island T Y A Y e. N (A) = 0. 25 e. N (G) = 0. 25 e. N (C) = 0. 25 e. N (T) = 0. 25 A Y T Y C Y T Y G Y Emission: 0. 1 0. 4 0. 1 0. 4 Transition: 0. 2 0. 7 0. 7 0. 7 0. 1 s: A : Y G Y C Y G Y T Y A Y T Y C Y T Y G N Emission: 0. 1 0. 4 0. 1 0. 4 0. 1 0. 25 Transition: 0. 2 0. 7 0. 7 0. 2 0. 1

States Computing P( s, p | M ) for each path is a redundant operation End L R Begin 0 1 2 3 T-1 T Iteration If we compute the common part only once we gain 2·(T-1) operations

Summing over all the possible paths BEGIN Gp. C Island a. YY = 0. 7 e. Y (A) = 0. 1 e. Y (G) = 0. 4 e. Y (C) = 0. 4 e. Y (T) = 0. 1 s: A : Y Emission: Transition: a 0 N = 0. 8 a. YN = 0. 2 Y a. Y 0 = 0. 1 a. NY = 0. 1 0. 0056 Emission: Transition: 0. 001 e. N (A) = 0. 25 e. N (G) = 0. 25 e. N (C) = 0. 25 e. N (T) = 0. 25 Emission: Transition: s: A : X G Y 0. 25 0. 4 0. 8 0. 1 0. 008 s: A : N G N 0. 1 0. 25 0. 2 a. NN = 0. 8 END s: A : N Non- Gp. C Island N a. N 0 = 0. 1 G Y 0. 1 0. 4 0. 2 0. 7 s: A : Y a 0 Y= 0. 2 G N 0. 25 0. 8 0. 04 G Y 0. 0136 Sum s: A : X 0. 041 G N

Summing over all the possible paths BEGIN a 0 Y= 0. 2 Gp. C Island a. YN = 0. 2 Y a. YY = 0. 7 e. Y (A) = 0. 1 e. Y (G) = 0. 4 e. Y (C) = 0. 4 e. Y (T) = 0. 1 s: A : X G C Y Y 0. 0136 0. 4 0. 7 s: A : X a 0 N = 0. 8 a. Y 0 = 0. 1 a. NY = 0. 1 a. NN = 0. 8 N a. N 0 = 0. 1 END Non- Gp. C Island e. N (A) = 0. 25 e. N (G) = 0. 25 e. N (C) = 0. 25 e. N (T) = 0. 25 s: A : X + s: A : X G C N Y 0. 041 0. 4 C Y 0. 005448 0. 1 G C s: A G C Y N : X N N 0. 25 0. 0136 0. 25 0. 041 + 0. 2 0. 8 G X Sum s: A : X G X C N 0. 00888

Summing over all the possible paths BEGIN Gp. C Island a 0 Y= 0. 2 a. YN = 0. 2 Y a. YY = 0. 7 e. Y (A) = 0. 1 e. Y (G) = 0. 4 e. Y (C) = 0. 4 e. Y (T) = 0. 1 a 0 N = 0. 8 a. NN = 0. 8 N a. NY = 0. 1 a. Y 0 = 0. 1 Non- Gp. C Island a. N 0 = 0. 1 END e. N (A) = 0. 25 e. N (G) = 0. 25 e. N (C) = 0. 25 e. N (T) = 0. 25 Iterating until the last position of the sequence: A X G X C X G X T X A X T X C X T X G Y 0. 1 (a. Y 0) A X G X C X G X T X A X T X C X T X + G N 0. 1 (a. N 0) P(s|M)

States Summing over all the possible paths End L R Begin 0 1 2 3 T-1 T Iteration If we know the probabilities of emitting the two first characters of the sequence ending the path in states L and R respectively: FR(2) P(s 1, s 2, p(2) = R | M) and FL(2) P(s 1, s 2, p(2) = L | M) then we can compute: P(s 1, s 2, s 3, p(3) = R | M) = FR(2) · a. RR ·e. R(s 3) + FL(2) · a. LR ·e. R(s 3)

Forward Algorithm On the basis of preceding observations the computation of P(s | M) can be decomposed in simplest problems For each state k and each position i in the sequence, we compute: Fk(i) = P( s 1 s 2 s 3……s i, Initialisation: FBEGIN (0) = 1 Recurrence: p (i) = k | M) Fi (0) = 0 Will be understood i BEGIN Fl ( i+1) = P( s 1 s 2…s is i+1, p (i + 1) = l ) = = k P( s 1 s 2 …s i, p (i) = k ) a k l e l ( s i+1 ) = = e l ( s i+1 ) k Fk ( i ) a k l Termination: P( s ) = P( s 1 s 2 s 3……s T, p (T + 1) = END ) = = k P( s 1 s 2 …s T , p (T) = k ) a k 0 = k Fk ( T ) a k 0

Forward Algorithm STATE Fi (1) ∙ ai. B P(s | M) END e. B (s 2) FB (2) B A S BEGIN 0 1 2 Iteration T T+1

Forward algorithm: computational complexity Naïf method P ( s | M ) = P( s, p | M ) There are N T possible paths. Each path requires about 2 T operations. The time for the computation is O( T N T ) s: A : Y G Y C Y G Y T Y A Y T Y C Y T Y G Y Emission: 0. 1 0. 4 0. 1 0. 4 Transition: 0. 2 0. 7 0. 7 0. 7 0. 1 s: A : Y G Y C Y G Y T Y A Y T Y C Y T Y G N Emission: 0. 1 0. 4 0. 1 0. 4 0. 1 0. 25 Transition: 0. 2 0. 7 0. 7 0. 2 0. 1

Forward algorithm: computational complexity ·Forward algorithm T positions, N values for each position Each element requires about 2 N product and 1 sum The time for the computation is O(T N 2) s: A : X G C Y Y 0. 0136 0. 4 0. 7 s: A : X + s: A : X G C N Y 0. 041 0. 4 C Y 0. 005448 0. 1 G C s: A G C Y N : X N N 0. 25 0. 0136 0. 25 0. 041 + 0. 2 0. 8 G X Sum s: A : X G X C N 0. 00888

Forward algorithm: computational complexity Naïf method Forward algorithm

Hidden Markov Models: Algorithms • Resumé • Evaluating P(s | M): Forward Algorithm • Evaluating P(s | M): Backward Algorithm

Backward Algorithm Similar to the Forward algorithm: it computes P( s | M ), reconstructing the sequence from the end For each state k and each position i in the sequence, we compute: Bk(i) = P( s i+1 s i+2 s i+3……s T | p (i) = k ) Initialisation: Bk (T) = P(p (T+1) = END | p (T) = k ) = ak 0 Recurrence: Bl ( i-1) = P(s is i+1…s T | p (i - 1) = l ) = = k P(s i+1 s i+2…s T | p (i) = k) a l k e k (s i )= = k Bk ( i ) e k ( s i ) a l k Termination: P( s ) = P( s 1 s 2 s 3……s T | p (0) = BEGIN ) = = k P( s 2 …s T | p (1) = k ) a 0 k e k ( s 1 ) = = k Bk ( 1 ) a 0 k e k ( s 1 )

Backward Algorithm STATE Bk(T)· a. B T· e k (s T-1 ) END BB (T-1) B A S BEGIN 0 P(s | M) 1 2 T-1 Iteration T T+1

Hidden Markov Models: Algorithms • Resumé • Evaluating P(s | M): Forward Algorithm • Evaluating P(s | M): Backward Algorithm • Showing the path: Viterbi decoding

Finding the best path BEGIN Gc. P Island a. YY = 0. 7 e. Y (A) = 0. 1 e. Y (G) = 0. 4 e. Y (C) = 0. 4 e. Y (T) = 0. 1 s: A : Y Emission: Transition: a 0 N = 0. 8 a. YN = 0. 2 Y a. Y 0 = 0. 1 a. NY = 0. 1 0. 0056 Emission: Transition: 0. 001 e. N (A) = 0. 25 e. N (G) = 0. 25 e. N (C) = 0. 25 e. N (T) = 0. 25 Emission: Transition: s: A : N G Y 0. 25 0. 4 0. 8 0. 1 0. 008 s: A : N G N 0. 1 0. 25 0. 2 a. NN = 0. 8 END s: A : N Non- Gc. P Island N a. N 0 = 0. 1 G Y 0. 1 0. 4 0. 2 0. 7 s: A : Y a 0 Y= 0. 2 G N 0. 25 0. 8 0. 04 G Y 0. 008 Max s : A : N 0. 04 G N

Finding the best path BEGIN a 0 Y= 0. 2 Gc. P Island a. YN = 0. 2 Y a. YY = 0. 7 e. Y (A) = 0. 1 e. Y (G) = 0. 4 e. Y (C) = 0. 4 e. Y (T) = 0. 1 s: A : N G C Y Y 0. 008 0. 4 0. 7 =0. 00224 s: A : N G C Y N 0. 008 0. 25 0. 2 =0. 0004 a 0 N = 0. 8 a. Y 0 = 0. 1 a. NY = 0. 1 a. NN = 0. 8 N a. N 0 = 0. 1 END e. N (A) = 0. 25 e. N (G) = 0. 25 e. N (C) = 0. 25 e. N (T) = 0. 25 s: A : N G C N Y 0. 04 0. 4 0. 1 Non- Gc. P Island G Y C Y 0. 00224 =0. 0016 s: A : N G C N N 0. 04 0. 25 ; 0. 8 =0. 008 Max s : A : N G N C N 0. 008

Finding the best path BEGIN Gc. P Island a 0 Y= 0. 2 a. YN = 0. 2 Y a. YY = 0. 7 e. Y (A) = 0. 1 e. Y (G) = 0. 4 e. Y (C) = 0. 4 e. Y (T) = 0. 1 a 0 N = 0. 8 a. NN = 0. 8 N a. NY = 0. 1 a. Y 0 = 0. 1 Non- Gc. P Island a. N 0 = 0. 1 END e. Y (A) = 0. 25 e. Y (G) = 0. 25 e. Y (C) = 0. 25 e. Y (T) = 0. 25 Iterating until the last position of the sequence: A N G Y C Y G Y T Y A N T N C Y T Y G Y 0. 1 (a. Y 0) A N G N C N G N T N A N T N C N T N G N 0. 1 (a. N 0) Choose the Maximum

Viterbi Algorithm p* = argmax p [ P( p , s | M ) ] The computation of P(s, p*| M) can be decomposed in simplest problems Let Vk(i) be the probability of the most probable path for generating the subsequence s 1 s 2 s 3……s i ending in the state k at iteration i Initialisation: VBEGIN (0) = 1 Recurrence: Vi (0) = 0 i BEGIN Vl ( i+1) = e l ( s i+1 ) Max k ( Vk ( i ) a k l ) ptr i ( l ) = argmax k ( Vk ( i ) a k l ) Termination: P( s, p* ) =Maxk (Vk ( T ) a k 0 ) Traceback: p* ( T ) = argmax k (Vk ( T ) a k 0 ) p* ( i-1 ) = ptr i (p* ( i ))

Viterbi Algorithm STATE Vi (1) ∙ ai. B P(s, *| M) END e. B (s 2) VB (2) B A MAX BEGIN 0 1 ptr 2 (B) 2 Iteration T T+1

Viterbi Algorithm STATE END B A BEGIN 0 1 2 T– 1 T Iteration Viterbi path Different paths can have the same probability T+1

Hidden Markov Models: Algorithms • Resumé • Evaluating P(s | M): Forward Algorithm • Evaluating P(s | M): Backward Algorithm • Showing the path: Viterbi decoding • Showing the path: A posteriori decoding • Training a model: EM algorithm

If we know the path generating the training sequence BEGIN a 0 Y= ? Gc. P Island a. YN = ? Y a. YY = ? e. Y (A) = ? e. Y (G) = ? e. Y (C) = ? e. Y (T) = ? s: A : Y Emission: Transition: G Y C Y a 0 N = ? G Y C Y a. N 0 = ? END G Y a. NN = ? N a. NY = ? a. Y 0 = ? Non- Gc. P Island T Y A N e. N (A) = ? e. N (G) = ? e. N (C) = ? e. N (T) = ? A N T N C N T N G N e. Y (A) e. Y (G) e. Y (A) e Y(G) e. Y (C) e. Y (G) e. Y (T) e. N (A) e. N (T) e. N (C) e. N (T) e. N (G) a 0 Y Just count! Example: a. YY a. YY a. YN a. NN a. NN a. N 0 a. YY= n. YY /(n. YY+ n. YN)= 6/7 e. Y(A) = n. Y(A) /[n. Y(A)+n. Y(C) +n. Y(G) +n. Y(T)]= 1/7

Expectation-Maximisation algorithm We need to estimate the Maximum Likelihood parameters when the paths generating the training sequences are unknown ML = argmax [P ( s | , M)] Given a model with parameters 0 the EM algorithm finds new parameters that increase the likelihood of the model: P( s | ) > P( s| 0 )

Expectation-Maximisation algorithm s: A G C G T A A T C T G 1: Y Y Y Y 2: Y Y Y N 3: Y Y Y N Y 4: Y Y Y N N ………………………………. . . Given a path p we can count the number of transition between states k and l: Ak, l(p) the number on emissions of character c from state k: Ek (c, p) We can compute the expected values over all the paths Ak, l = P(p | s, 0) · Ak, l(p) Ek (c) = P(p | s, 0) · Ek (c, p) The updated parameters are: ak, l = Ak, l N m=1 Ak, m ek(c) = Ek (c) c Ek (c)

Expectation-Maximisation algorithm log P( s | ) = log P(s, p | ) - log P(p | s, ) Multiplying for P(p | s, 0) and summing over all the possible paths: log P( s | ) = = P(p | s, 0) ·log P(s, p | ) - P(p | s, 0) · log P(p | s, ) Q( | 0) : Expectation value of log P(s, p | ) over all the “current” paths log P( s | ) - log P(s | 0) = = Q( | 0) - Q( 0| 0) + Q( | 0) - Q( 0| 0) 0

Expectation-Maximisation algorithm The EM algorithm is an iterative process Each iteration performs two steps: E-step: evaluation of Q( | 0) = P(p | s, 0) ·log P(s, p | ) M-step: Maximisation of Q( | 0) over all It does NOT assure to converge to the GLOBAL Maximum Likelihood

Baum-Welch implementation of the EM algorithm E-step: Q( | 0) = P(p | s, 0) ·log P(s, p | ) = a 0, (1) · P N T i = 1 a (i), (i+1) N Ak. l ( ) = Pk = 0 Pl = 1 ak, l ·e (i)(si) = N · Pk = 1 Pc C ek (c) Ek (c, ) Ak, l(p): number of transitions between the states k and l in path Ek (c, p): number of emissions of character c in path Ak, l = P(p | s, 0) · Ak, l(p) Ek (c) = P(p | s, 0) · Ek (c, p) So: N N Expected values over all the “actual” paths N Q( | = k = 0 l = 1 Ak, l · log ak, l + k = 1 c C Ek (c) ·log ek (c) 0)

Baum-Welch implementation of the EM algorithm M-step: For any state k and l, with l ak, l = 1 For any state k and character c, with c ek(c) = 1 By means of Lagrange’s multipliers techniques, we can solve the system ak, l = ek(c) = Ak, l N m=1 Ak, m Ek (c) N m=1 Em (c)

Baum-Welch implementation of the EM algorithm How to compute the expected number of transitions and emissions over all the paths Fk(i) = P( s 1 s 2 s 3……s i, p (i) = k ) Bk(i) = P( s i+1 s i+2 s i+3……s T | p (i) = k ) Ak, l i +1) B (i + 1) F (i ) a e (s kl l l = P(p (i ) = k , p ( i +1) = l | s, ) = i k P (s ) Ek (c) = P( s = c , p (i ) = k | s, ) = i s = c Fk(i ) Bl(i) i P (s )

Baum-Welch implementation of the EM algorithm Algorithm Start with random parameters Compute Forward and Backward matrices on the known sequences Compute Ak, l and Ek (c) expected numbers of transitions and emissions Update a k, l Ak, l ek (c) Ek (c) Yes Has P(s|M) incremented ? No End

Profile HMMs • HMMs for alignments

How to align? M 0 M 1 M 2 M 3 M 5 M 4 Each state represent a position in the alignment. A M 0 C M 1 G M 2 G M 3 T M 4 A M 5 A M 0 C M 1 G M 2 A M 3 T M 4 C M 5 A M 0 ?

Silent states Each state represent a position in the alignment. Red transitions allow gaps (N-1) ! transitions We can reduce the number of parameters we can use states that doesn’t emit any character 4 N-5 transitions

Profile HMMs Delete states D 1 D 2 D 3 D 4 I 0 I 1 I 2 I 3 I 4 M 0 M 1 M 2 M 3 M 4 A M 0 C M 1 G M 2 G M 3 T M 4 A M 5 A C G C M 0 I 0 M 1 A M 2 G M 3 T M 4 C M 5 A M 0 G M 2 D 1 A M 3 T M 4 Insert states M 5 C M 5 Match states

Example of alignment D 1 D 2 D 3 D 4 I 0 I 1 I 2 I 3 I 4 M 0 M 1 M 2 M 3 M 4 A S M 0 M 1 A S M 5 Sequence 1 T R A L Viterbi path M 2 M 3 M 4 M 5 T R A L M 0 M 1 M 2 A S T Sequence 2 T A I L Viterbi path D 3 M 4 I 4 M 5 A I L M 0 M 1 A R Sequence 3 A R T I Viterbi path M 2 D 3 D 4 M 5 T I A S M 5

Example of alignment M 0 M 1 M 2 M 3 M 4 A S T R A M 5 L Sequence 1 M 0 M 1 M 2 D 3 M 4 I 4 M 5 A S T A I L Sequence 2 M 0 M 1 M 2 D 3 D 4 A R T Sequence 3 -Log P(s | M) M 5 I Is an alignment score

Searching for a structural/functional pattern in protein sequence Zn binding loop: C C C D C C H H H C C C I L I L C C D C R K S T S K I I L I I I Cysteines can be replaced by an Aspartic Acid, but only ONCE for each sequence C C C

Searching for a structural/functional pattern in protein sequences D 1 D 2 D 3 D 4 D 5 D 6 I 0 I 1 I 2 I 3 I 4 I 5 I 6 M 0 M 1 M 2 M 3 M 4 M 5 M 6 M 7 . . ALCPCHCLCRICPLIY. . obtains higher probability than. . WERWDHCIDSICLKDE. . because M 0 and M 4 have low emission probability for Aspartic Acid and we multiply them twice.

Profile HMMs • HMMs for alignments • Example on globins

Structural alignment of globins Bashdorf D, Chothia C & Lesk AM, (1987) Determinants of a protein fold: unique features of the globin amino sequence. J. Mol. Biol. 196, 199 -216

Alignment of globins reconstructed with profile HMMs Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J. Mol. Biol. 235, 1501 -1531

Discrimination power of profile HMMs Log(P(s | M)) - <Log( P(s | M))> Z-score = s (Log(P(s | M)) ) Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J. Mol. Biol. 235, 1501 -1531

Profile HMMs • HMMs for alignments • Example on globins • Other applications

Finding a domain Profile HMM specific for the considered domain Begin End I 2

Clustering subfamilies HMM 1 HMM 2 B E G I N HMM 3 . E N D. HMM n Each sequence s contributes to update HMM i with a weight equal to P ( s | Mi )

Profile HMMs • HMMs for alignments • Example on globins • Other applications • Available codes and servers

HMMER at WUSTL: http: //hmmer. wustl. edu/ Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14: 755 -763

HMMER applications: http: //www. sanger. ac. uk/Software/Pfam/

SAM at UCSD: http: //www. soe. ucsc. edu/research/compbio/sam. html Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J. Mol. Biol. 235, 1501 -1531

SAM applications: http: //www. cse. ucsc. edu/research/compbio/HMM-apps/T 02 -query. html

HMMPRO: http: //www. netid. com/html/hmmpro. html Pierre Baldi, Net-ID

HMMs for Mapping problems • Mapping problems in protein prediction

Secondary structure Covalent structure TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN Secondary structure EEEE. . HHHHHH. EEEE. . . 3 D structure Nt Ct

Topology of membrane proteins position of Trans Membrane Segments along the sequence ALALMLCMLTYRHKELKLKLKK Outer Membrane Inner Membrane -barrel -helices Bilayer Topography Porin (Rhodobacter capsulatus) Bacteriorhodopsin (Halobacterium salinarum)

HMMs for Mapping problems • Mapping problems in protein prediction • Labelled HMMs

HMM for secondary structure prediction Simplest model a b c Introducing a grammar a 1 a 2 a 3 c b 1 b 2

HMM for secondary structure prediction Labels The states a 1, a 2 and a 3 share the same label, so states b 1 and b 2 do. Decoding the Viterbi path for emitting a sequence s, makes a mapping between the sequence s and a sequence of labels y S A L K M N Y T R E I M V A S N Q c a 1 a 2 a 3 a 4 a 4 c c b 1 b 2 b 2 c c c a a a c c b b c c a 1 a 2 a 3 c b 1 b 2 s: Sequence p: Path Y(p): Labels

Computing P(s, y | M) Only the path whose labelling is y have to be considered in the sum In Forward and Backward algorithms it means to set Fk(i) = 0, Bk(i) = 0 if Y(k) yi S A L K M N Y T R E I M V A S N Q s: Sequence c a a a c c b b c c y: Labels 1 2 3 4 1 2 c a a b b c States Labelling

Baum-Welch training algorithm for labelled HMMs Given a set of known labelled sequences (e. g. amino acid sequences and their native secondary structure) we want to find the parameters of the model, without knowing the generating paths: ML = argmax [P ( s, y | , M)] The algorithm is the same as in the non-labelled case if we use the Forward and Backward matrices defined in the last slide. Supervised learning of the mapping

HMMs for Mapping problems • Mapping problems in protein prediction • Labelled HMMs • Duration modelling

Self loops and geometric decay p Begin 1 -p End P(l) p=0. 9 P(l) = p l-1 ·( 1 -p ) p=0. 5 p=0. 1 l The length distribution of the generated segments is always exp-like

How can we model other length distributions? Limited case p 1 Begin p 2 1 2 p 3 3 p 4 4 …. N p. N End P(1) P(l) P(6) P(2) P(4) P(3) P(5) P(8) P(7) l This topology can model any length distribution between 1 and N

How can we model other length distributions? Non limited case p 1 Begin 1 p 2 2 p 3 3 p 4 4 …. N p. N+1 End p. N This topology can model any length distribution between 1 and N-1 and a geometrical decay from N and

Secondary structure: length statistic

Secondary structure: model a 1 a 2 a 3 a 4 c 1 c 2 c 3 c 4 b 1 b 2 b 3 b 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 a 13 a 14 b 5 Do we use the same emission probabilities for states sharing the same label?

HMMs for Mapping problems • Mapping problems in protein prediction • Labelled HMMs • Duration modelling • Models for membrane proteins

Outer Membrane Inner Membrane -helices Bilayer -barrel Porin (Rhodobacter capsulatus) Bacteriorhodopsin (Halobacterium salinarum)

A generic model for membrane proteins (TMHMM) End Outer Side Transmembrane Begin Inner Side

Model of -barrel membrane proteins End Begin Outer Side Transmembrane Inner Side

Model of -barrel membrane proteins End Begin Outer Side Labels: Transmembrane states Loop states Inner Side

Model of -barrel membrane proteins End Begin Outer Side Transmembrane Length of transmembrane b-strands: Minimum: 6 residues Maximum: unbounded Inner Side

Model of -barrel membrane proteins End Begin Outer Side Transmembrane Inner Side Six different sets of emission parameters: Outer loop TM strands edges Inner loop Long globular domains TM strands core

Model of -helix membrane proteins (HMM 1) . . . x 13 x 12 x 10 . . . x 10 x 12 . . . Outer Side . . x 13 . . . Transmembrane . . . Inner Side

Model of -helix membrane proteins (HMM 2) . . . . x 10 . . . Outer Side . . . Transmembrane Inner Side

Dynamic programming filtering procedure

Dynamic programming filtering procedure Maximum-scoring subsequences with constrained segment length and number

Predictors of alpha-transmembrane topology www. cbs. dtu. dk/services/TMHMM

Hybrid systems: Basics • Sequence profile based HMMs

Sequence profile MSA Sequence profiles 1 2 3 4 5 6 7 8 9 10 A C D E F G H K I L M N P Q R S T V W Y Y Y Y G T T K R R K R Y K K D D D D G G G Y Y Y Y F Y Y H Q Q V Q N Q G G G 0 0 0 10 0 0 20 0 0 70 0 0 0 40 0 0 0 50 0 10 0 70 0 0 30 0 0 0 0 10 0 0 0 90 0 0 30 10 0 40 0 10 0 0 S T S S F T T F F G G D D D H D L L L 0 0 0 0 33 0 0 100 0 0 0 0 33 0 0 0 0 0 60 0 10 0 0 30 0 0 K Q H H Q Q H I I I K K K K K N N N G G G N A T T T E D E E S E D E E E 0 0 0 30 0 10 100 30 0 0 0 0 0 0 0 70 0 30 0 0 0 0 10 0 0 50 0 0 10 0 0 30 0 0 20 70 0 0 10 0 0 L L L S L T T T sequence position T T T K K K 0 0 0 0 0 0 100 0 0 0 60 0 0 0 10 0 0 30 100 0 0

Sequence-profile-based HMM. . A C 90 0 0 0 0 10 0 0 n 0 85 0 0 0 0 2 0 8 0 0 0 0 L 0 0 4 0 13 0 4 0 5 0 6 0 0 23 0 1 44 0 P R 0 0 22 0 23 0 0 5 0 23 0 11 0 0 2 0 11 0 0 34 0 0 0 24 0 0 0 22 0 18 0 0 P E 8 0 0 0 92 0 0 0 0 90 0 0 0 0 77 0 23 Sequence of characters T. . . 3 0 2 7 4 0 8 6 1 3 6 5 5 12 5 6 17 2 2 6 st t Sequence of M-dimensional vectors vt For proteins M=20 Constraints t, n 0 vt (n) S k=1 v (k) = M t S t

Sequence-profile-based HMM Probability of emission from state k Sequence of characters Sequence of M-dimensional vectors constraints P(vt|k) d. M vt = 1 If M n=1 e (n) = 1 k st vt P(st|k) = ek(st) 1 P(vt|k) = Z n=1 v (n) e (n) SA Z= A! M t k M n=1 e (n) k Z is independent of the state Algorithms for training and probability computation can be derived

Hybrid systems: Basics • Sequence profile based HMMs • Membrane protein topology

Scoring the prediction 1) Accuracy: Q 2=P/N where P is the total number of correctly predicted residues and N is the total number of residues 2) Correlation coefficient C: C(s)=(p(s)*n(s) - u(s)*o(s))/[(p(s)+u(s))(p(s)+o(s))(n(s)+u(s))(n(s)+o(s))]1/2 where, for each class s, p(s) and n(s) are respectively the total number of correct predictions and correctly rejected assignments while u(s) and o(s) are the numbers of under and over predictions 3) Accuracy for each discriminated structure s: Q(s)=p(s)/[p(s)+u(s)] where p(s) and u(s) are the same as in equation 2 4) Probability of correct predictions P(s) : P(s)=p(s)/[p(s)+o(s)] where p(s) and o(s) are the same as in equation 2 5) Segment-based measure (Sov):

Topology of -barrel membrane proteins Performance of sequence-profile-based HMM Q 2 83% 76% 78% QTMS Qloop 83% 77% 74% 82% 76% 82% PTMS Ploop 79% 72% 81% 85% 80% 76% Corr 0. 65 0. 53 0. 56 Sov 0. 83 0. 64 0. 79 HMM based on Multiple Sequence Alignment Standard HMM based on Single Sequence NN based on Multiple Sequence Alignment Martelli PL, Fariselli P, Krogh A, Casadio R -A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins- Bioinformatics 18: S 46 -S 53 (2002)

Topology of -barrel membrane proteins Discriminative power of the profile-based HMM I(s | M) = -1/L log P(s | M)

The Bologna predictor for the topology of all membrane proteins NN Sequence Profiles HMM 1 HMM 2 S Max. Sub. Seq Topography Von Heijne rule Topology Prediction

Topology of all- membrane proteins Performance ° ° Test Martelli PL, Fariselli P, Casadio R -An ENSEMBLE machine learning approach for the prediction of all-alpha membrane proteins- Bioinformatics (in press, 2003)