Probabilistic Model of Sequences Ata Kaban The University

Probabilistic Model of Sequences Ata Kaban The University of Birmingham

Sequence • • Example 1: a b a c Example 2: 1 0 0 1 Example 3: 1 2 3 4 5 6 1 2 3 Roll a six-sided die N times. You get a sequence. Roll it again: You get another sequence. Here is a sequence of characters, can you see it? What is a sequence? Alphabet 1 = {a, b, c}, Alphabet 2={0, 1}, Alphabet 3={1, 2, 3, 4, 5, 6}

Probabilistic Model • Model = system that simulates the sequence under consideration • Probabilistic model = model that produces different outcomes with different probabilities – It includes uncertainty – It can therefore simulate a whole class of sequences & assigns a probability to each individual sequence • Could you simulate any of the sequences on the previous slide?

Random sequence model • Back to the die example (can possibly be loaded) – Model of a roll: has 6 parameters: p 1, p 2, p 3, p 4, p 5, p 6 – Here, p_i is the probability of throwing i – To be probabilities, these must be non-negative and must sum to one. – What is the probability of the sequence [1, 6, 3]? p 1*p 6*p 3 • NOTE: in the random sequence model, the individual symbols in a sequence do not depend on each other. This is the simplest sequence model.

Maximum Likelihood parameter estimation • The parameters of a probabilistic model are typically estimated from large sets of trusted examples, called training set. • Example (t=tail, h=head) : [t t t h h t] – Count up the frequencies: t 5, h 3 – Compute probabilities: • p(t)=5/(5+3), p(h)=3/(5+3) – These are the Maximum Likelihood (ML) estimates of the parameters of the coin. – Does it make sense? – What if you know the coin is fair?

Overfitting • A fair coin has probabilities p(t)=0. 5, p(h)=0. 5 • If you throw it 3 times and get [t, t, t], then the ML estimates for this sequence are p(t)=1, p(h)=0. • Consequently, from these estimates, the probability of e. g. the sequence [h, t, h, t] = …………. • This is an example of what is called overfitting. Overfitting is the greatest enemy of Machine Learning! • Solution 1: get more data • Solution 2: build in what you already know into the model. (Will return to it during the module)

Why is it called Maximum Likelihood? • It can be shown that using the frequencies to compute probabilities maximises the total probability of all the sequences given the model (the likelihood). P(Data|parameters)

Probabilities • Have two dice D 1 and D 2 • The probability of rolling I given die D 1 is called P(i|D 1). This is a conditional probability • Pick a die at random with probability P(Dj), j=1 or 2. The probability for picking die Dj and rolling i is is called joint probability and is P(I, Dj)=P(Dj)P(I|Dj). • For any events X and Y, P(X, Y)=P(X|Y)P(Y) • If we know P(X, Y), then the so-called marginal probability p(X) can be computed as

• Now, we show that maximising P(Data|parameters) for the random sequence model leads to the frequency-based computation that we did intuitively.

Why did we bother? Because in more complicated models we cannot ‘guess’ the result.

Markov Chains • Further examples of sequences: – Bio-sequences – Web page request sequences while browsing • These are not anymore random sequences, but have a time-structure. • How many parameters would such a model have? • We need to make simplifying assumptions to end up with a reasonable number of parameters • The first order Markov assumption: the observation only depends on the immediately previous one, no longer history • Markov Chain = sequence model which makes the Markov assumption

Markov Chains • The probability of a Markov sequence: • The alphabet’s symbols are also called states • Once the parameters are estimated from training data, the Markov chain can be used for prediction • Amongst others, Markov Chains are successful for web browsing behavior prediction

Markov Chains • A Markov Chain is stationary if at any time, it has the same transition probabilities. • We assume stationary models here. • Then the parameters of the model consist of the transition probability matrix & initial state probabilities.

ML parameter estimation • We can derive how to compute the parameters of a Markov Chain from data, using Maximum Likelihood, as we did for random sequences. • The ML estimate of the transition matrix will be again very intuitive: Remember that

Simple example • If it is raining today, it will rain tomorrow with probability 0. 8 implies the contrary has probability 0. 2 • If it is not raining today, it will rain tomorrow with probability 0. 6 implies the contrary has probability 0. 4 • Build the transition matrix • Be careful which numbers need to sum to one and which don’t. Such a matrix is called stochastic matrix. • Q: It rained all week, including today. What does this model predict for tomorrow? Why? What does it predict for a day from tomorrow? (*Homework)

Examples of Web Applications • HTTP request prediction: – To predict the probabilities of the next requests from the same user based on the history of requests from that client. • Adaptive Web navigation: – To build a navigation agent which suggests which other links would be of interest to the user based on the statistics of previous visits. – The predicted link does not strictly have to be a link present in the Web page currently being viewed. • Tour generation: – Is given as input the starting URL and generates a sequence of states (or URLs) using the Markov chain process.

Building Markov Models from Web Log Files • A Web log file is a collection of records of user requests for documents on a Web site, an example: 177. 21. 3. 4 - - [04/Apr/1999: 00: 01: 11 +0100] "GET /studaffairs/ccampus. html HTTP/1. 1" 200 5327 "http: //www. ulst. ac. uk/studaffairs/accomm. html" "Mozilla/4. 0 (compatible; MSIE 4. 01; Windows 95)" • Transition matrix can be seen as a graph – Link pair: (r - referrer, u - requested page, w - hyperlink weight) – Link graph: it is called the state diagram of the Markov. Chain • a directed weighted graph • a hierarchy from the homepage down to multiple levels

Link Graph: an example (University of Ulster site) Zhu et al. 2002 State diagram: - Nodes: states - Weighted arrows: number of transitions

Experimental Results (Sarukkai, 2000) • Simulations : – ‘Correct link’ refers to the actual link chosen at the next step. – ‘depth of the correct link’ is measured by counting the umber of links which have a probability greater than or equal to the correct link. – Over 70% of correct links are in the top 20 scoring states. – Difficulties: very large state space

Simple exercise • Build the Markov transition matrix of the following sequence: [a b b a c a b c b b d e e d e d] State space: {……………. }

Further topics • Hidden Markov Model – Does not make the Markov assumption on the observed sequence – Instead, it assumes that the observed sequence was generated by another sequence which is unobservable (hidden), and this other sequence is assumed to be Markovian – More powerful – Estimation is more complicated • Aggregate Markov model – Useful for clustering sub-graphs of a transition graph

HMM at an intuitive level • Suppose that we know all the parameters of the following HMM, as shown on the state-diagram below. What is the probability of observing the sequence [A, B] if the initial state is S 1? The same question if the initial state is chosen randomly with equal probabilities. ANSWER: If the initial state is S 1: 0. 2*(0. 4*0. 8+0. 6*0. 7) = 0. 148. In the second case: 0. 5*0. 148+0. 5*0. 3*(0. 3*0. 7+0. 7 *0. 8) = 0. 1895.

Conclusions • Probabilistic Model • Maximum Likelihood parameter estimation • Random sequence model • Markov chain model ---------------- • Hidden Markov Model • Aggregate Markov Model

Any questions?