Machine Learning for Sequential Data Thomas G Dietterich


















































- Slides: 50
Machine Learning for Sequential Data Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 http: //www. cs. orst. edu/~tgd
Outline w Sequential Supervised Learning w Research Issues w Methods for Sequential Supervised Learning w Concluding Remarks
Some Example Learning Problems w Cellular Telephone Fraud w Part-of-speech Tagging w Information Extraction from the Web w Hyphenation for Word Processing
Cellular Telephone Fraud w Given the sequence of recent telephone calls, can we determine which calls (if any) are fraudulent?
Part-of-Speech Tagging w Given an English sentence, can we assign a part of speech to each word? w “Do you want fries with that? ” w <verb pron verb noun prep pron>
Information Extraction from the Web <dl><dt><b>Srinivasan Seshan</b> (Carnegie Mellon University) <dt><a href=…><i>Making Virtual Worlds Real</i></a><dt>Tuesday, June 4, 2002<dd>2: 00 PM , 322 Sieg<dd>Research Seminar * * * name * * affiliation * * title * * * date * time * location * event-type
Hyphenation w “Porcupine” ! “ 001010000”
Sequential Supervised Learning (SSL) w Given: A set of training examples of the form (Xi, Yi), where Xi = hxi, 1, … , xi, Tii and Yi = hyi, 1, … , yi, Tii are sequences of length Ti w Find: A function f for predicting new sequences: Y = f(X).
Examples as Sequential Supervised Learning Domain Telephone Fraud Input Xi sequence of calls Output Yi sequence of labels {ok, fraud} Part-of-speech sequence of Tagging words sequence of parts of speech Information Extraction sequence of tokens sequence of field labels {name, …} Hyphenation sequence of letters sequence of {0, 1} 1 = hyphen ok
Two Kinds of Relationships w Relationships between the xt’s and yt’s n Example: “Friday” is usually a “date” w Relationships among the yt’s n Example: “name” is usually followed by “affiliation” w SSL can (and should) exploit both kinds of information
Two Other Tasks that are Not SSL w Sequence Classification w Time-Series Prediction
Sequence Classification w Given an input sequence, assign one label to the entire sequence w Example: Recognize a person from their handwriting. n n Input sequence: Sequence of pen strokes Output label: Name of person
Time-Series Prediction w Given: A sequence hy 1, … , yti predict yt+1. w Example: Predict unemployment rate for next month based on history of unemployment rates.
Key Differences w In SSL, there is one label yi, t for each input xi, t w In SSL, we are given the entire X sequence before we need to predict any of the yt values w In SSL, we do not have any of the true y values when we predict yt+1
Outline w Sequential Supervised Learning w Research Issues w Methods for Sequential Supervised Learning w Concluding Remarks
Research Issues for SSL w Loss Functions n How do we measure performance? w Feature Selection and Long-distance Interactions n How do we model relationships among the yt’s, especially long-distance effects? w Computational Cost n How do we make it efficient?
Basic Loss Functions w Count the number of entire sequences Yi correctly predicted (i. e. , every yi, t must be right) w Count the number of individual labels yi, t correctly predicted
More Complex Loss Functions Loss: True labels: 0 0 0 1 1 1 1 Phone calls: x x x x x x x Loss is computed for first “fraudulent” prediction
More Complex Loss Functions (2) w Hyphenation n n False positives are very bad Need at least one correct hyphen near middle of word
Hyphenation Loss w Perfect: “qual-i-fi-ca-tion” w Very good: “quali-fi-cation” w OK: “quali-fication”, “qualifi-cation” w Worse: “qual-ification”, “qualifica-tion” w Very bad: “qua-lification”, “qualificatio-n”
Feature Selection and Long Distance Effects w Any solution to SSL must employ some form of divide-and-conquer w How do we determine the information relevant for predicting yt?
Long Distance Effects w Consider the text-to-speech problem: n n “photograph” => /f-Ot@graf-/ “photography” =>/f-@t. Agr@f-i/ w The letter “y” changes the pronunciation of all vowels in the word!
Standard Feature Selection Methods w Wrapper method with forward selection or backwards elimination w Optimize feature weights w Measures of feature influence w Fit simple models to test for relevance
Wrapper Methods w Stepwise Regression w Wrapper Methods (Kohavi, et al. ) w Problem: Very inefficient with large numbers of possible features
Optimizing the Feature Weights w Start with all features in the model w Encourage the learning algorithm to remove irrelevant features w Problem: There are too many possible features. We can’t include them all in the model.
Measures of Feature Influence w Importance of single features n Mutual information, Correlation w Importance of feature subsets n n Schema racing (Moore, et al. ) RELIEFF (Kononenko, et al. ) w Question: Will subset methods scale to thousands of features?
Fitting Simple Models w Fit simple models using all of the features. Analyze the resulting model to determine feature importance n n Belief networks and Markov blanket analysis L 1 Support Vector Machines w Prediction: These will be the most practical methods
Outline w Sequential Supervised Learning w Research Issues w Methods for Sequential Supervised Learning w Concluding Remarks
Methods for Sequential Supervised Learning w Sliding Windows w Recurrent Sliding Windows w Hidden Markov Models and company n n n Maximum Entropy Markov Models Input-Output HMMs Conditional Random Fields
Sliding Windows ___ Do you want fries with ___ Do you want ! that ___ verb ! you want fries pron ! want fries with verb ! noun fries with that ! prep with that ___ ! pron
Properties of Sliding Windows w Converts SSL to ordinary supervised learning w Only captures the relationship between (part of) X and yt. Does not explicitly model relations among the yt’s w Does not capture long-distance interactions w Assumes each window is independent
Recurrent Sliding Windows ___ Do you want fries with that ___ Do you want verb ! you want fries pron ! want fries with verb ! fries with that noun ! with that ___ prep ! pron
Recurrent Sliding Windows w Key Idea: Include yt as input feature when computing yt+1. w During training: n n Use the correct value of yt Or train iteratively (especially recurrent neural networks) w During evaluation: n Use the predicted value of yt
Properties of Recurrent Sliding Windows w Captures relationship among the y’s, but only in one direction! w Results on text-to-speech: Method Direction sliding window none recurrent s. w. left-right recurrent s. w. right-left Words 12. 5% 17. 0% 24. 4% Letters 69. 6% 67. 9% 74. 2%
Hidden Markov Models y 1 y 2 y 3 x 1 x 2 x 3 w yt’s are generated as a Markov chain w xt’s are generated independently (as in naïve Bayes or Gaussian classifiers).
Hidden Markov Models (2) w Models both the xt $ yt relationships and the yt $ yt+1 relationships. w Does not handle long-distance effects n Everything must be captured by the current label yt. w Does not permit rich X $ yt relationships n Unlike the sliding window, we can’t use several xt’s to predict yt.
Using HMMs w Training n Extremely simple, because the yt’s are known on the training set. w Execution: Dynamic Programming methods n n If the loss function depends on the whole sequence, use the Viterbi algorithm: argmax. Y P(Y | X) If the loss function depends on individual yt predictions, use the forward-backward algorithm: argmaxyt P(yt | X)
HMM Alternatives: Maximum Entropy Markov Models y 1 y 2 y 3 x 1 x 2 x 3
MEMM Properties w Permits complex X $ yt relationships by employing a sparse maximum entropy model of P(yt+1|X, yt): P(yt+1|X, yt) / exp(Sb ab fb(X, yt+1)) where fb is a boolean feature. w Training can be expensive (gradient descent or iterative scaling)
HMM Alternatives (2): Input/Output HMM y 1 y 2 y 3 h 1 h 2 h 3 x 1 x 2 x 3
IOHMM Properties w Hidden states permit “memory” of long distance effects (beyond what is captured by the class labels) w As with MEMM, arbitrary features of the input X can be used to predict yt.
Label Bias Problem w Forward models that are normalized at each step exhibit a problem. w Consider a domain with only two sequences: “rib” ! “ 111” and “rob” ! “ 222”. w Consider what happens when an MEMM sees the sequence “rib”.
Label Bias Problem (2) w After “r”, both labels 1 and 2 have same probability. After “i”, label 2 must still send all of its probability forward, even though it was expecting “o”. Result: both output strings “ 111” and “ 222” have same probability. r 1 r 2 i o 1 2 b b 1 2
Conditional Random Fields y 1 y 2 y 3 x 1 x 2 x 3 w The yt’s form a Markov Random Field conditioned on X.
Representing the CRF parameters w Each undirected arc yt $ yt+1 represents a potential function: M(yt, yt+1|X) = exp[Sa la fa(yt, yt+1, X) + Sb bb gb(yt, X)] where fa and gb are arbitrary boolean features.
Using CRFs P(Y|X) / M(y 1, y 2|X) ¢ M(y 2, y 3|X) ¢ … ¢ M(y. T-1, y. T|X) w Training: Gradient descent or iterative scaling
CRFs on Part-of-speech tagging HMM MEMM CRF baseline 5. 69 6. 37 5. 55 spelling features 5. 69 4. 87 4. 27 spelling features (OOV) 45. 99 26. 99 23. 76 Lafferty, Mc. Callum & Pereira (2001) (error rates in percent)
Summary of Methods Issue SW RSW HMM MEMM IOHMM CRF xt $ yt yt $ yt+1 long dist? NO Partly YES YES NO Partly NO NO YES? NO X $ yt rich? YES NO YES YES efficient? YES YES? label bias ok? YES YES NO NO YES
Loss Functions and Training w Kakade, Teh & Roweis (2002) show that if the loss function depends only on errors of yt, then MEMM’s, IOHMM’s, and CRF’s should be trained to maximize the likelihood P(yt | X) instead of P(Y|X) or P(X, Y).
Concluding Remarks w Many applications of pattern recognition can be formalized as Sequential Supervised Learning w Many methods have been developed specifically for SSL, but none is perfect w Similar issues arise in other complex learning problems (e. g. , spatial and relational data)