Hidden Markov Models HMM Hassanin M AlBarhamtoshy hassaninkau

Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy hassanin@kau. edu. sa

Agenda • Markov Models • Description: What is HMM • HMM Applications • Demonstration

Markov Models • Set of states or number of states : Q = {q 1; q 2; : : : ; q. T} • Process moves from one state to another generating a sequence of states : • Markov chain property: probability of each subsequent state depends only on what was the previous state: • To define Markov model, the following probabilities have to be specified: transition probabilities and initial probabilities

Example 1 of Markov Model 0. 3 0. 7 Rain Dry 0. 2 0. 8 • Two states : ‘Rain’ and ‘Dry’. • Transition probabilities: P(‘Rain’ | ‘Rain’) = 0. 3 , P(‘Dry’ | ‘Rain’) = 0. 7, P(‘Rain’ | ‘Dry’) = 0. 2 , P(‘Dry’ | ‘Dry’) = 0. 8 • Initial probabilities: say P(‘Rain’) = 0. 4 , P(‘Dry’) = 0. 6.

Calculation of Sequence probability • By Markov chain property, probability of state sequence can be found by the formula: 0. 3 0. 7 Rain Dry 0. 2 0. 8 • Suppose we want to calculate a probability of a sequence of states in our example, {‘Dry’, ’Rain’, ‘Rain’}. P({‘Dry’, ’Rain’, ‘Rain’} ) = P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’) = 0. 3 x 0. 2 x 0. 8 x 0. 6

What is an HMM? • Graphical Model • Circles indicate states • Arrows indicate probabilistic dependencies between states

What is an HMM? • Green circles are hidden states • Dependent only on the previous state

What is an HMM? • Purple nodes are observed states • Dependent only on their corresponding hidden state

HMM Formalism S S S K K K • { S, K, P, A, B } • S : {s 1…s. N } are the values for the hidden states • K : {k 1…k. M } are the values for the observations

HMM Formalism S A S B K • • K A S B K {S, K, P, A, B} P = { i} are the initial state probabilities A = {aij} are the state transition probabilities B = {bik} are the observation state probabilities K

Inference in an HMM • Compute the probability of a given observation sequence • Given an observation sequence, compute the most likely hidden state sequence • Given an observation sequence and set of possible models, which model most closely fits the data?

Decoding o 1 ot-1 ot ot+1 o. T Given an observation sequence and a model, compute the probability of the observation sequence

Decoding x 1 xt-1 xt xt+1 x. T o 1 ot-1 ot ot+1 o. T

Forward Procedure x 1 xt-1 xt xt+1 x. T o 1 ot-1 ot ot+1 o. T • Special structure gives us an efficient solution using dynamic programming. • Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences. • Define:

Forward Procedure x 1 xt-1 xt xt+1 x. T o 1 ot-1 ot ot+1 o. T

Backward Procedure x 1 xt-1 xt xt+1 x. T o 1 ot-1 ot ot+1 o. T Probability of the rest of the states given the first state

Decoding Solution x 1 xt-1 xt xt+1 x. T o 1 ot-1 ot ot+1 o. T Forward Procedure Backward Procedure Combination

Best State Sequence o 1 ot-1 ot ot+1 o. T • Find the state sequence that best explains the observations • Viterbi algorithm

Viterbi Algorithm x 1 xt-1 j o 1 ot-1 ot ot+1 o. T The state sequence which maximizes the probability of seeing the observations to time t -1, landing in state j, and seeing the observation at time t

Viterbi Algorithm x 1 xt-1 xt xt+1 o 1 ot-1 ot ot+1 o. T Recursive Computation

Viterbi Algorithm x 1 xt-1 xt xt+1 x. T o 1 ot-1 ot ot+1 o. T Compute the most likely state sequence by working backwards

Parameter Estimation x 1 B o 1 A xt-1 B ot-1 A A B ot A x. T B B ot+1 o. T • Given an observation sequence, find the model that is most likely to produce that sequence. • No analytic method • Given a model and observation sequence, update the model parameters to better fit the observations.

Parameter Estimation A B o 1 A B ot-1 A B ot A B B ot+1 o. T Probability of traversing an arc Probability of being in state i

Parameter Estimation A B o 1 A B ot-1 A B ot A B B ot+1 o. T Now we can compute the new estimates of the model parameters.

HMM Applications • Generating parameters for n-gram models • Tagging speech • Speech recognition

The Most Important Thing A B o 1 A B ot-1 A B ot A B B ot+1 o. T We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not solvable.

Remember: Hidden Markov Models. • Set of states: • Process moves from one state to another generating a sequence of states : • Markov chain property: probability of each subsequent state depends only on what was the previous state: • States are not visible, but each state randomly generates one of M observations (or visible states) • To define hidden Markov model, the following probabilities have to be specified: matrix of transition probabilities A=(aij), aij= P(si | sj) , matrix of observation probabilities B=(bi (vm )), bi(vm ) = P(vm | si) and a vector of initial probabilities =( i), i = P(si). Model is represented by M=(A, B, ).

Example of Hidden Markov Model 0. 3 0. 7 Low High 0. 2 0. 6 Rain 0. 4 0. 8 0. 4 0. 6 Dry

Example of Hidden Markov Model 0. 3 0. 7 Low • Two states : ‘Low’ and ‘High’ atmospheric pressure. • Two observations : ‘Rain’ and ‘Dry’. • Transition probabilities: P(‘Low’|‘Low’)=0. 3 , • • P(‘High’|‘Low’)=0. 7 , P(‘Low’|‘High’)=0. 2, P(‘High’|‘High’)=0. 8 Observation probabilities: P(‘Rain’|‘Low’)= 0. 6, P(‘Dry’|‘Low’)= 0. 4 , P(‘Rain’|‘High’)= 0. 4 , P(‘Dry’|‘High’)= 0. 3. Initial probabilities: say P(‘Low’)=0. 4 , P(‘High’)=0. 6. High 0. 2 0. 6 Rain 0. 4 0. 8 0. 4 0. 6 Dry

Calculation of observation sequence probability 0. 3 x 1 B A Xt-1 A B o 1 ot-1 A A B ot+1 x. T 0. 7 Low High 0. 2 B o. T 0. 6 Rain 0. 4 0. 8 0. 4 0. 6 Dry • Suppose we want to calculate a probability of a sequence of observations in our example, {‘Dry’, ’Rain’}. • Consider all possible hidden state sequences: P({‘Dry’, ’Rain’} ) = P({‘Dry’, ’Rain’} , {‘Low’, ’Low’}) + P({‘Dry’, ’Rain’} , {‘Low’, ’High’}) + P({‘Dry’, ’Rain’} , {‘High’, ’Low’}) + P({‘Dry’, ’Rain’} , {‘High’, ’High’}) where first term is : P({‘Dry’, ’Rain’} , {‘Low’, ’Low’}) = P({‘Dry’, ’Rain’} | {‘Low’, ’Low’}) P({‘Low’, ’Low’}) = P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low)

Main issues using HMMs : M=(A, B, ) and the observation sequence O=o 1 o 2. . . o. K , calculate the probability that model M has generated sequence O. Decoding problem. Given the HMM M=(A, B, ) and the observation sequence O=o 1 o 2. . . o. K , calculate the most likely sequence of hidden states si that produced this observation sequence O. • Evaluation problem. Given the HMM • • Learning problem. Given some training observation sequences O=o 1 o 2. . . o. K and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B, ) that best fit training data. O=o 1. . . o. K denotes a sequence of observations ok {v 1, …, v }. M

Word Recognition example(1). • Typed word recognition, assume all characters are separated. • Character recognizer outputs probability of the image being particular character, P (image | character). a b c 0. 5 0. 03 0. 005 z 0. 31 Hidden state Observation

Word Recognition example(2). • Hidden states of HMM = characters. • Observations = typed images of characters segmented from the image. Note that there is an infinite number of observations • Observation probabilities = character recognizer scores. • Transition probabilities will be defined differently in two subsequent models.

Word Recognition example(3). • If lexicon is given, we can construct separate HMM models for each lexicon word. Amherst a m h e r s t Buffalo b u f f a l o 0. 5 0. 03 0. 4 0. 6 • Here recognition of word image is equivalent to the problem of evaluating few HMM models. • This is an application of Evaluation problem.

Word Recognition example(4). • We can construct a single HMM for all words. • Hidden states = all characters in the alphabet. • Transition probabilities and initial probabilities are calculated from language model. • Observations and observation probabilities are as before. a m f r t o b h e s v • Here we have to determine the best sequence of hidden states, the one that most likely produced word image. • This is an application of Decoding problem.

Character Recognition with HMM example. • The structure of hidden states is chosen. • Observations are feature vectors extracted from vertical slices. • Probabilistic mapping from hidden state to feature vectors: 1. Use mixture of Gaussian models 2. Quantize feature vector space.

Exercise: Character Recognition with HMM(1) • The structure of hidden states: s 1 s 2 • Observation = number of islands in the vertical slice. • HMM for character ‘A’ : . 8. 2 0 Transition probabilities: {aij} = 0. 8. 2 0 0 1 . 9. 1 0 Observation probabilities: {bjk}= . 1. 8. 1 . 9. 1 0 • HMM for character ‘B’ : . 8. 2 0 Transition probabilities: {aij} = 0. 8. 2 0 0 1 . 9. 1 0 Observation probabilities: {bjk}= 0. 2. 8 . 6. 4 0 s 3

Exercise: Character Recognition with HMM(2) • Suppose that after character image segmentation the following sequence of island numbers in 4 slices was observed: { 1, 3, 2, 1} • What HMM is more likely to generate this observation sequence , HMM for ‘A’ or HMM for ‘B’ ?

Exercise: Character Recognition with HMM(3) Consider likelihood of generating given observation for each possible sequence of hidden states: . 8. 2 0 0 1 . 9. 1 0 . 1. 8. 1 . 9. 1 0 Hidden state sequence Transition probabilities Observation probabilities s 1 s 2 s 3 . 8 . 2 . 9 0 . 8 . 9 = . 2 . 8 . 2 . 9 . 1 . 8 . 9 = 0. 0020736 s 1 s 2 s 3 . 2 1 . 9 . 1 . 9 = 0. 000324 • HMM for character ‘A’: 0 Total = 0. 0023976 • HMM for character ‘B’: Hidden state sequence Transition probabilities Observation probabilities s 1 s 2 s 3 . 8 . 2 . 9 0 . 2 . 6 = . 2 . 8 . 2 . 9 . 8 . 2 . 6 = 0. 0027648 s 1 s 2 s 3 . 2 1 . 9 . 8 . 4 . 6 = 0. 006912 0 Total = 0. 0096768

Evaluation Problem. M=(A, B, ) and the observation sequence O=o 1 o 2. . . o. K , calculate the probability that model M has generated sequence O. Trying to find probability of observations O=o 1 o 2. . . o. K by • Evaluation problem. Given the HMM • means of considering all hidden state sequences (as was done in example) is impractical: NK hidden state sequences - exponential complexity. • Use Forward-Backward HMM algorithms for efficient calculations. • Define the forward variable k(i) as the joint probability of the partial observation sequence o 1 o 2. . . ok and that the hidden state at time k is si : k(i)= P(o 1 o 2. . . ok , qk= si )

Lattice Representation of an HMM o 1 ok ok+1 o. K = s 1 s 1 s 2 s 2 sj si a 1 j a 2 j si si aij a. Nj Time= s. N 1 k k+1 K Observations

Forward Recursion for HMM • Initialization: 1(i)= P(o 1 , q 1= si ) = i bi (o 1) , 1<=i<=N. • Forward recursion: k+1(i)= P(o 1 o 2. . . ok+1 , qk+1= sj ) = i P(o 1 o 2. . . ok+1 , qk= si , qk+1= sj ) = i P(o 1 o 2. . . ok , qk= si) aij bj (ok+1 ) = [ i k(i) aij ] bj (ok+1 ) , 1<=j<=N, 1<=k<=K-1. • Termination: P(o 1 o 2. . . o. K) = i P(o 1 o 2. . . o. K , q. K= si) = i K(i) • Complexity : N 2 K operations.

Backward Recursion for HMM • Define the forward variable k(i) as the joint probability of the partial observation sequence ok+1 ok+2. . . o. K given that the hidden state at time k is si : k(i)= P(ok+1 ok+2. . . o. K |qk= si ) • Initialization: K(i)= 1 , 1<=i<=N. • Backward recursion: k(j)= P(ok+1 ok+2. . . o. K | qk= sj ) = i P(ok+1 ok+2. . . o. K , qk+1= si | qk= sj ) = i P(ok+2 ok+3. . . o. K | qk+1= si) aji bi (ok+1 ) = i k+1(i) aji bi (ok+1 ) , 1<=j<=N, 1<=k<=K-1. • Termination: P(o 1 o 2. . . o. K) = i P(o 1 o 2. . . o. K , q 1= si) = i P(o 1 o 2. . . o. K |q 1= si) P(q 1= si) = i 1(i) bi (o 1) i

Decoding Problem • Decoding problem. Given the HMM M=(A, B, ) and the O=o 1 o 2. . . o. K , calculate the most likely sequence of hidden states si that produced this observation sequence. • We want to find the state sequence Q= q 1…q. K which maximizes P(Q | o 1 o 2. . . o. K ) , or equivalently P(Q , o 1 o 2. . . o. K ). observation sequence • Brute force consideration of all paths takes exponential time. Use efficient Viterbi algorithm instead. k(i) as the maximum probability of producing observation sequence o 1 o 2. . . ok when moving along any hidden state sequence q 1… qk-1 and getting into qk= si. k(i) = max P(q 1… qk-1 , qk= si , o 1 o 2. . . ok) where max is taken over all possible paths q 1… qk-1. • Define variable

Description Specification of an HMM • N - number of states – Q = {q 1; q 2; : : : ; q. T} - set of states • M - the number of symbols (observables) – O = {o 1; o 2; : : : ; o. T} - set of symbols

Specification of an HMM • A - the state transition probability matrix – aij = P(qt+1 = j | qt = i) • B- observation probability distribution – bj(k) = P(ot = k | qt = j) i ≤ k ≤ M • π - the initial state distribution n Full HMM is thus specified as a triplet: ¨λ = (A , B , π)

Central problems in HMM modelling • Problem 1 Evaluation: – Probability of occurrence of a particular observation sequence, O = {o 1, …, ok}, given the model – P(O|λ) – Complicated – hidden states – Useful in sequence classification

Central problems in HMM modelling • Problem 2 Decoding: – Optimal state sequence to produce given observations, O = {o 1, …, ok}, given model – Optimality criterion – Useful in recognition problems

Central problems in HMM modelling • Problem 3 Learning: – Determine optimum model, given a training set of observations – Find λ, such that P(O|λ) is maximal

Viterbi Algorithm (1) • General idea: if best path ending in qk= sj goes through qk-1= si then it should coincide with best path ending in qk-1= si. qk-1 s 1 si qk a 1 j aij a. Nj sj s. N • k(i) = max P(q 1… qk-1 , qk= sj , o 1 o 2. . . ok) = maxi [ aij bj (ok ) max P(q 1… qk-1= si , o 1 o 2. . . ok-1) ] • To backtrack best path keep info that predecessor of sj was si.

Viterbi algorithm (2) • Initialization: 1(i) = max P(q 1= si , o 1) = i bi (o 1) , 1<=i<=N. • Forward recursion: k(j) = max P(q 1… qk-1 , qk= sj , o 1 o 2. . . ok) = maxi [ aij bj (ok ) max P(q 1… qk-1= si , o 1 o 2. . . ok-1) ] = maxi [ aij bj (ok ) k-1(i) ] , 1<=j<=N, 2<=k<=K. • Termination: choose best path ending at time K maxi [ K(i) ] • Backtrack best path. This algorithm is similar to the forward recursion of evaluation problem, with replaced by max and additional backtracking.

Learning Problem (1) • Learning problem. Given some training observation sequences O=o 1 o 2. . . o. K and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B, ) that best fit training data, that is maximizes P(O | M). • There is no algorithm producing optimal parameter values. • Use iterative expectation-maximization algorithm to find local maximum of P(O | M) - Baum-Welch algorithm.

Learning Problem (2) • If training data has information about sequence of hidden states (as in word recognition example), then use maximum likelihood estimation of parameters: aij= P(si | sj) = Number of transitions from state sj to state si bi(vm ) = P(vm | si)= Number of transitions out of state sj vm occurs in state si Number of times observation