Introduction to Profile Hidden Markov Models Mark Stamp

  • Slides: 45
Download presentation
Introduction to Profile Hidden Markov Models Mark Stamp PHMM 1

Introduction to Profile Hidden Markov Models Mark Stamp PHMM 1

Hidden Markov Models q Here, u we assume you know about HMMs If not,

Hidden Markov Models q Here, u we assume you know about HMMs If not, see “A revealing introduction to hidden Markov models” q Executive summary of HMMs HMM is a machine learning technique u Also, a discrete hill climb technique u Train model based on observation sequence u Score given sequence to see how closely it matches the model u Efficient algorithms, many useful applications u PHMM 2

HMM Notation q Recall, HMM model denoted λ = (A, B, π) q Observation

HMM Notation q Recall, HMM model denoted λ = (A, B, π) q Observation sequence is O q Notation: PHMM 3

Hidden Markov Models q Among the many uses for HMMs… q Speech analysis q

Hidden Markov Models q Among the many uses for HMMs… q Speech analysis q Music search engine q Malware detection q Intrusion detection systems (IDS) q Many more, and more all the time PHMM 4

Limitations of HMMs q Positional information not considered HMM has no “memory” u Higher

Limitations of HMMs q Positional information not considered HMM has no “memory” u Higher order models have some memory u But no explicit use of positional information u q Does not handle insertions or deletions q These limitations are serious problems in some applications In bioinformatics string comparison, sequence alignment is critical u Also, insertions and deletions occur u PHMM 5

Profile HMM q Profile HMM (PHMM) designed to overcome limitations on previous slide u

Profile HMM q Profile HMM (PHMM) designed to overcome limitations on previous slide u In some ways, PHMM easier than HMM u In some ways, PHMM more complex q The basic idea of PHMM u Define multiple B matrices u Almost like having an HMM for each position in sequence PHMM 6

PHMM q In bioinformatics, begin by aligning multiple related sequences u Multiple sequence alignment

PHMM q In bioinformatics, begin by aligning multiple related sequences u Multiple sequence alignment (MSA) u This is like training phase for HMM q Generate u Easy, PHMM based on given MSA once MSA is known u Hard part is generating MSA q Then u Use can score sequences using PHMM forward algorithm, like HMM PHMM 7

Generic View of PHMM q Circles are Delete states q Diamonds are Insert states

Generic View of PHMM q Circles are Delete states q Diamonds are Insert states q Rectangles are Match states u Match states correspond to HMM states u Each transition has associated probability q Arrows are possible transitions q Transition probabilities are A matrix q Emission probabilities are B matrices In PHMM, observations are emissions u Match and insert states have emissions u PHMM 8

Generic View of PHMM q Circles are Delete states, diamonds are Insert states, rectangles

Generic View of PHMM q Circles are Delete states, diamonds are Insert states, rectangles are Match states q Also, begin and end states PHMM 9

PHMM Notation q Notation PHMM 10

PHMM Notation q Notation PHMM 10

PHMM q Match state probabilities easily determined from MSA, that is transitions between match

PHMM q Match state probabilities easily determined from MSA, that is transitions between match states u e. Mi(k) emission probability at match state u a. Mi, Mi+1 q Note: u For other transition probabilities example, a. Mi, Ii and a. Mi, Di+1 q Emissions at all match & insert states u Remember, emission == observation PHMM 11

MSA q First we show MSA construction q Then construct PHMM from MSA u

MSA q First we show MSA construction q Then construct PHMM from MSA u This is the difficult part u Lots of ways to do this u “Best” way depends on specific problem u The easy part u Standard algorithm for this q How to score a sequence? u Forward algorithm, similar to HMM PHMM 12

MSA q How to construct MSA? u Construct pairwise alignments u Combine pairwise alignments

MSA q How to construct MSA? u Construct pairwise alignments u Combine pairwise alignments to obtain MSA q Allow gaps to be inserted u Makes q But better matches gaps tend to weaken scoring u So there is a tradeoff PHMM 13

Global vs Local Alignment q In these pairwise alignment examples “-” is gap u

Global vs Local Alignment q In these pairwise alignment examples “-” is gap u “|” are aligned u “*” omitted beginning and ending symbols u PHMM 14

Global vs Local Alignment q Global u But alignment is lossless gaps tend to

Global vs Local Alignment q Global u But alignment is lossless gaps tend to proliferate u And gaps increase when we do MSA u More gaps implies more sequences match u So, result is less useful for scoring q We usually only consider local alignment u That q For is, omit ends for better alignment simplicity, we assume global alignment here PHMM 15

Pairwise Alignment q We allow gaps when aligning q How to score an alignment?

Pairwise Alignment q We allow gaps when aligning q How to score an alignment? Based on n x n substitution matrix S u Where n is number of symbols u q What algorithm(s) to align sequences? Usually, dynamic programming u Sometimes, HMM is used u Other? u q Local alignment --- more issues PHMM 16

Pairwise Alignment q Example q Note gaps vs misaligned elements u Depends on S

Pairwise Alignment q Example q Note gaps vs misaligned elements u Depends on S and gap penalty PHMM 17

Substitution Matrix q Masquerade u Detect q Consider detection imposter using an account 4

Substitution Matrix q Masquerade u Detect q Consider detection imposter using an account 4 different operations u. E == send email u G == play games u C == C programming u J == Java programming q How similar are these to each other? PHMM 18

Substitution Matrix q Consider u 4 different operations: E, G, C, J q Possible

Substitution Matrix q Consider u 4 different operations: E, G, C, J q Possible substitution matrix: q Diagonal --- matches u High positive scores q Which u J and C, so substituting C for J is a high score q Game u others most similar? playing/programming, very different So substituting G for C is a negative score PHMM 19

Substitution Matrix q Depending on problem, might be easy or very difficult to get

Substitution Matrix q Depending on problem, might be easy or very difficult to get useful S matrix q Consider masquerade detection based on UNIX commands u Sometimes difficult to say how “close” 2 commands are q Suppose aligning DNA sequences u Biological rationale for closeness of symbols PHMM 20

Gap Penalty q q Generally must allow gaps to be inserted But gaps make

Gap Penalty q q Generally must allow gaps to be inserted But gaps make alignment more generic So, less useful for scoring u Therefore, we penalize gaps u q q How to penalize gaps? Linear gap penalty function u q f(g) = dg (i. e. , constant penalty per gap) Affine gap penalty function f(g) = a + e(g – 1) u Gap opening penalty a, then constant factor of e u PHMM 21

Pairwise Alignment Algorithm q We use dynamic programming u Based on S matrix, gap

Pairwise Alignment Algorithm q We use dynamic programming u Based on S matrix, gap penalty function q Notation: PHMM 22

Pairwise Alignment DP q Initialization: q Recursion: PHMM 23

Pairwise Alignment DP q Initialization: q Recursion: PHMM 23

MSA from Pairwise Alignments q Given pairwise alignments… q …how to construct MSA? q

MSA from Pairwise Alignments q Given pairwise alignments… q …how to construct MSA? q Generic approach is “progressive alignment” u Select one pairwise alignment u Select another and combine with first u Continue to add more until all are combined q Relatively easy (good) q Gaps may proliferate, unstable (bad) PHMM 24

MSA from Pairwise Alignments q Lots of ways to improve on generic progressive alignment

MSA from Pairwise Alignments q Lots of ways to improve on generic progressive alignment Here, we mention one such approach u Not necessarily “best” or most popular u q Feng-Dolittle progressive alignment Compute scores for all pairs of n sequences u Select n-1 alignments that a) “connect” all sequences and b) maximize pairwise scores u Then generate a minimum spanning tree u For MSA, add sequences in the order that they appear in the spanning tree u PHMM 25

MSA Construction q Create pairwise alignments Generate substitution matrix u Dynamic program for pairwise

MSA Construction q Create pairwise alignments Generate substitution matrix u Dynamic program for pairwise alignments u q Use pairwise alignments to make MSA Use pairwise alignments to construct spanning tree (e. g. , Prim’s Algorithm) u Add sequences to MSA in spanning tree order (from highest score, insert gaps as needed) u Note: gap penalty is used u PHMM 26

MSA Example q Suppose 10 sequences, with the following pairwise alignment scores: PHMM 27

MSA Example q Suppose 10 sequences, with the following pairwise alignment scores: PHMM 27

MSA Example: Spanning Tree q Spanning tree based on scores q So process pairs

MSA Example: Spanning Tree q Spanning tree based on scores q So process pairs in following order: (5, 4), (5, 8), (8, 3), (3, 2), (2, 7), (2, 1), (1, 6), (6, 10), (10, 9) PHMM 28

MSA Snapshot q Intermediate step and final u Use “+” for neutral symbol u

MSA Snapshot q Intermediate step and final u Use “+” for neutral symbol u Then “-” for gaps in MSA q Note increase in gaps PHMM 29

PHMM from MSA q For PHMM, must determine match and insert states & probabilities

PHMM from MSA q For PHMM, must determine match and insert states & probabilities from MSA q “Conservative” columns are match states u Half or less of symbols are gaps q Other columns are insert states u Majority q Delete of symbols are gaps states are a separate issue PHMM 30

PHMM States from MSA q Consider a simpler MSA… q Columns 1, 2, 6

PHMM States from MSA q Consider a simpler MSA… q Columns 1, 2, 6 are match states 1, 2, 3, respectively u Since less than half gaps q Columns 3, 4, 5 are combined to form insert state 2 Since more than half gaps u Match states between insert u PHMM 31

PHMM Probabilities from MSA q Emission probabilities u Based on symbol distribution in match

PHMM Probabilities from MSA q Emission probabilities u Based on symbol distribution in match and insert states q State transition probs u Based on transitions in the MSA PHMM 32

PHMM Probabilities from MSA q Emission probabilities: q But 0 probabilities are bad Model

PHMM Probabilities from MSA q Emission probabilities: q But 0 probabilities are bad Model “overfits” the data u So, use “add one” rule u Add one to each numerator, add total to denominators u PHMM 33

PHMM Probabilities from MSA q More emission probabilities: q But 0 probabilities are bad

PHMM Probabilities from MSA q More emission probabilities: q But 0 probabilities are bad Model “overfits” the data u Again, use “add one” rule u Add one to each numerator, add total to denominators u PHMM 34

PHMM Probabilities from MSA q Transition probabilities: q We look at some examples u

PHMM Probabilities from MSA q Transition probabilities: q We look at some examples u Note that “-” is delete state q First, consider begin state: q Again, use add one rule PHMM 35

PHMM Probabilities from MSA Transition probabilities q When no information in MSA, set probs

PHMM Probabilities from MSA Transition probabilities q When no information in MSA, set probs to uniform q For example I 1 does not appear in MSA, so q PHMM 36

PHMM Probabilities from MSA Transition probabilities, another example q What about transitions from state

PHMM Probabilities from MSA Transition probabilities, another example q What about transitions from state D 1? q Can only go to M 2, so q q Again, use add one rule: PHMM 37

PHMM Emission Probabilities q Emission probabilities for the given MSA u Using add-one rule

PHMM Emission Probabilities q Emission probabilities for the given MSA u Using add-one rule PHMM 38

PHMM Transition Probabilities q Transition probabilities for the given MSA u Using add-one rule

PHMM Transition Probabilities q Transition probabilities for the given MSA u Using add-one rule PHMM 39

PHMM Summary q Construct u Usually, q Use pairwise alignments use dynamic programming these

PHMM Summary q Construct u Usually, q Use pairwise alignments use dynamic programming these to construct MSA u Lots q Using of ways to do this MSA, determine probabilities u Emission probabilities u State transition probabilities q In effect, we have trained a PHMM u Now what? ? ? PHMM 40

PHMM Scoring q Want to score sequences to see how closely they match PHMM

PHMM Scoring q Want to score sequences to see how closely they match PHMM q How did we score sequences with HMM? u Forward q How to score sequences with PHMM? u Forward q But, algorithm is a little more complex u Due to complex state transitions PHMM 41

Forward Algorithm q Notation u Indices i and j are columns in MSA u

Forward Algorithm q Notation u Indices i and j are columns in MSA u xi is ith observation symbol u qxi is distribution of xi in “random model” u Base case is u is score of x 1, …, xi up to state j (note that in PHMM, i and j may not agree) u Some states undefined u Undefined states ignored in calculation PHMM 42

Forward Algorithm q Compute q Note and u P(X|λ) recursively that depends on ,

Forward Algorithm q Compute q Note and u P(X|λ) recursively that depends on , And corresponding state transition probs PHMM 43

PHMM q We will see examples of PHMM later q In particular, u Malware

PHMM q We will see examples of PHMM later q In particular, u Malware detection based on opcodes u Masquerade detection based on UNIX commands PHMM 44

References q Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Durbin, et

References q Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Durbin, et al q Masquerade detection using profile hidden Markov models, L. Huang and M. Stamp, to appear in Computers and Security q Profile hidden Markov models for metamorphic virus detection, S. Attaluri, S. Mc. Ghee and M. Stamp, Journal in Computer Virology, Vol. 5, No. 2, May 2009, pp. 151 -169 PHMM 45