Introduction to Profile Hidden Markov Models Mark Stamp
- Slides: 45
Introduction to Profile Hidden Markov Models Mark Stamp PHMM 1
Hidden Markov Models q Here, u we assume you know about HMMs If not, see “A revealing introduction to hidden Markov models” q Executive summary of HMMs HMM is a machine learning technique u Also, a discrete hill climb technique u Train model based on observation sequence u Score given sequence to see how closely it matches the model u Efficient algorithms, many useful applications u PHMM 2
HMM Notation q Recall, HMM model denoted λ = (A, B, π) q Observation sequence is O q Notation: PHMM 3
Hidden Markov Models q Among the many uses for HMMs… q Speech analysis q Music search engine q Malware detection q Intrusion detection systems (IDS) q Many more, and more all the time PHMM 4
Limitations of HMMs q Positional information not considered HMM has no “memory” u Higher order models have some memory u But no explicit use of positional information u q Does not handle insertions or deletions q These limitations are serious problems in some applications In bioinformatics string comparison, sequence alignment is critical u Also, insertions and deletions occur u PHMM 5
Profile HMM q Profile HMM (PHMM) designed to overcome limitations on previous slide u In some ways, PHMM easier than HMM u In some ways, PHMM more complex q The basic idea of PHMM u Define multiple B matrices u Almost like having an HMM for each position in sequence PHMM 6
PHMM q In bioinformatics, begin by aligning multiple related sequences u Multiple sequence alignment (MSA) u This is like training phase for HMM q Generate u Easy, PHMM based on given MSA once MSA is known u Hard part is generating MSA q Then u Use can score sequences using PHMM forward algorithm, like HMM PHMM 7
Generic View of PHMM q Circles are Delete states q Diamonds are Insert states q Rectangles are Match states u Match states correspond to HMM states u Each transition has associated probability q Arrows are possible transitions q Transition probabilities are A matrix q Emission probabilities are B matrices In PHMM, observations are emissions u Match and insert states have emissions u PHMM 8
Generic View of PHMM q Circles are Delete states, diamonds are Insert states, rectangles are Match states q Also, begin and end states PHMM 9
PHMM Notation q Notation PHMM 10
PHMM q Match state probabilities easily determined from MSA, that is transitions between match states u e. Mi(k) emission probability at match state u a. Mi, Mi+1 q Note: u For other transition probabilities example, a. Mi, Ii and a. Mi, Di+1 q Emissions at all match & insert states u Remember, emission == observation PHMM 11
MSA q First we show MSA construction q Then construct PHMM from MSA u This is the difficult part u Lots of ways to do this u “Best” way depends on specific problem u The easy part u Standard algorithm for this q How to score a sequence? u Forward algorithm, similar to HMM PHMM 12
MSA q How to construct MSA? u Construct pairwise alignments u Combine pairwise alignments to obtain MSA q Allow gaps to be inserted u Makes q But better matches gaps tend to weaken scoring u So there is a tradeoff PHMM 13
Global vs Local Alignment q In these pairwise alignment examples “-” is gap u “|” are aligned u “*” omitted beginning and ending symbols u PHMM 14
Global vs Local Alignment q Global u But alignment is lossless gaps tend to proliferate u And gaps increase when we do MSA u More gaps implies more sequences match u So, result is less useful for scoring q We usually only consider local alignment u That q For is, omit ends for better alignment simplicity, we assume global alignment here PHMM 15
Pairwise Alignment q We allow gaps when aligning q How to score an alignment? Based on n x n substitution matrix S u Where n is number of symbols u q What algorithm(s) to align sequences? Usually, dynamic programming u Sometimes, HMM is used u Other? u q Local alignment --- more issues PHMM 16
Pairwise Alignment q Example q Note gaps vs misaligned elements u Depends on S and gap penalty PHMM 17
Substitution Matrix q Masquerade u Detect q Consider detection imposter using an account 4 different operations u. E == send email u G == play games u C == C programming u J == Java programming q How similar are these to each other? PHMM 18
Substitution Matrix q Consider u 4 different operations: E, G, C, J q Possible substitution matrix: q Diagonal --- matches u High positive scores q Which u J and C, so substituting C for J is a high score q Game u others most similar? playing/programming, very different So substituting G for C is a negative score PHMM 19
Substitution Matrix q Depending on problem, might be easy or very difficult to get useful S matrix q Consider masquerade detection based on UNIX commands u Sometimes difficult to say how “close” 2 commands are q Suppose aligning DNA sequences u Biological rationale for closeness of symbols PHMM 20
Gap Penalty q q Generally must allow gaps to be inserted But gaps make alignment more generic So, less useful for scoring u Therefore, we penalize gaps u q q How to penalize gaps? Linear gap penalty function u q f(g) = dg (i. e. , constant penalty per gap) Affine gap penalty function f(g) = a + e(g – 1) u Gap opening penalty a, then constant factor of e u PHMM 21
Pairwise Alignment Algorithm q We use dynamic programming u Based on S matrix, gap penalty function q Notation: PHMM 22
Pairwise Alignment DP q Initialization: q Recursion: PHMM 23
MSA from Pairwise Alignments q Given pairwise alignments… q …how to construct MSA? q Generic approach is “progressive alignment” u Select one pairwise alignment u Select another and combine with first u Continue to add more until all are combined q Relatively easy (good) q Gaps may proliferate, unstable (bad) PHMM 24
MSA from Pairwise Alignments q Lots of ways to improve on generic progressive alignment Here, we mention one such approach u Not necessarily “best” or most popular u q Feng-Dolittle progressive alignment Compute scores for all pairs of n sequences u Select n-1 alignments that a) “connect” all sequences and b) maximize pairwise scores u Then generate a minimum spanning tree u For MSA, add sequences in the order that they appear in the spanning tree u PHMM 25
MSA Construction q Create pairwise alignments Generate substitution matrix u Dynamic program for pairwise alignments u q Use pairwise alignments to make MSA Use pairwise alignments to construct spanning tree (e. g. , Prim’s Algorithm) u Add sequences to MSA in spanning tree order (from highest score, insert gaps as needed) u Note: gap penalty is used u PHMM 26
MSA Example q Suppose 10 sequences, with the following pairwise alignment scores: PHMM 27
MSA Example: Spanning Tree q Spanning tree based on scores q So process pairs in following order: (5, 4), (5, 8), (8, 3), (3, 2), (2, 7), (2, 1), (1, 6), (6, 10), (10, 9) PHMM 28
MSA Snapshot q Intermediate step and final u Use “+” for neutral symbol u Then “-” for gaps in MSA q Note increase in gaps PHMM 29
PHMM from MSA q For PHMM, must determine match and insert states & probabilities from MSA q “Conservative” columns are match states u Half or less of symbols are gaps q Other columns are insert states u Majority q Delete of symbols are gaps states are a separate issue PHMM 30
PHMM States from MSA q Consider a simpler MSA… q Columns 1, 2, 6 are match states 1, 2, 3, respectively u Since less than half gaps q Columns 3, 4, 5 are combined to form insert state 2 Since more than half gaps u Match states between insert u PHMM 31
PHMM Probabilities from MSA q Emission probabilities u Based on symbol distribution in match and insert states q State transition probs u Based on transitions in the MSA PHMM 32
PHMM Probabilities from MSA q Emission probabilities: q But 0 probabilities are bad Model “overfits” the data u So, use “add one” rule u Add one to each numerator, add total to denominators u PHMM 33
PHMM Probabilities from MSA q More emission probabilities: q But 0 probabilities are bad Model “overfits” the data u Again, use “add one” rule u Add one to each numerator, add total to denominators u PHMM 34
PHMM Probabilities from MSA q Transition probabilities: q We look at some examples u Note that “-” is delete state q First, consider begin state: q Again, use add one rule PHMM 35
PHMM Probabilities from MSA Transition probabilities q When no information in MSA, set probs to uniform q For example I 1 does not appear in MSA, so q PHMM 36
PHMM Probabilities from MSA Transition probabilities, another example q What about transitions from state D 1? q Can only go to M 2, so q q Again, use add one rule: PHMM 37
PHMM Emission Probabilities q Emission probabilities for the given MSA u Using add-one rule PHMM 38
PHMM Transition Probabilities q Transition probabilities for the given MSA u Using add-one rule PHMM 39
PHMM Summary q Construct u Usually, q Use pairwise alignments use dynamic programming these to construct MSA u Lots q Using of ways to do this MSA, determine probabilities u Emission probabilities u State transition probabilities q In effect, we have trained a PHMM u Now what? ? ? PHMM 40
PHMM Scoring q Want to score sequences to see how closely they match PHMM q How did we score sequences with HMM? u Forward q How to score sequences with PHMM? u Forward q But, algorithm is a little more complex u Due to complex state transitions PHMM 41
Forward Algorithm q Notation u Indices i and j are columns in MSA u xi is ith observation symbol u qxi is distribution of xi in “random model” u Base case is u is score of x 1, …, xi up to state j (note that in PHMM, i and j may not agree) u Some states undefined u Undefined states ignored in calculation PHMM 42
Forward Algorithm q Compute q Note and u P(X|λ) recursively that depends on , And corresponding state transition probs PHMM 43
PHMM q We will see examples of PHMM later q In particular, u Malware detection based on opcodes u Masquerade detection based on UNIX commands PHMM 44
References q Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Durbin, et al q Masquerade detection using profile hidden Markov models, L. Huang and M. Stamp, to appear in Computers and Security q Profile hidden Markov models for metamorphic virus detection, S. Attaluri, S. Mc. Ghee and M. Stamp, Journal in Computer Virology, Vol. 5, No. 2, May 2009, pp. 151 -169 PHMM 45
- A revealing introduction to hidden markov models
- A revealing introduction to hidden markov models
- Hidden markov models
- Kpuska
- Hidden markov model rock paper scissors
- Hidden markov model tutorial
- Hidden markov chain
- Hidden markov chain
- Hidden markov map matching through noise and sparseness
- Hidden markov model beispiel
- Hidden markov model
- Hidden surface removal adalah
- What is the difference between modals and semi modals
- What grade of seafood is marked with a stamp
- Ca stamp format
- Recipient company name
- Pan african and independence comprehension check answers
- Stamp duty(amendment) proclamation no. 612/2008
- Migratory bird hunting and conservation stamp act
- On package
- Define coupling in software engineering
- 10 usc 1044a notary stamp
- 10 usc 1044a
- Sugar and stamp act
- Avant stamp
- Rutherford stamp collecting
- Incoming mail register
- Pixpsd
- Idaho quest card balance
- Sugar and stamp act
- Stamp violence assessment tool
- The stamp tax uproar
- Miss lawrence billie holiday
- Place stamp here
- Place stamp here
- Kennings gas guzzler
- Small head fat body
- Basic stamp programming
- Chi xi stigma
- Cdc stamp
- 4 cent house
- Rubber dam
- Of stones collective noun
- Time stamping in database
- Boston tea party stamp
- Sophie scholl thilde scholl