I 256 Applied Natural Language Processing Marti Hearst

  • Slides: 65
Download presentation
I 256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally

I 256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

Today Algorithms for Classification Binary classification Perceptron Winnow Support Vector Machines (SVM) Kernel Methods

Today Algorithms for Classification Binary classification Perceptron Winnow Support Vector Machines (SVM) Kernel Methods Multi-Class classification Decision Trees Naïve Bayes K nearest neighbor 2

Binary Classification: examples Spam filtering (spam, not spam) Customer service message classification (urgent vs.

Binary Classification: examples Spam filtering (spam, not spam) Customer service message classification (urgent vs. not urgent) Sentiment classification (positive, negative) Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes 3

Binary Classification Given: some data items that belong to a positive (+1 ) or

Binary Classification Given: some data items that belong to a positive (+1 ) or a negative (-1 ) class Task: Train the classifier and predict the class for a new data item Geometrically: find a separator 4

Linear versus Non Linear algorithms Linearly separable data: if all the data points can

Linear versus Non Linear algorithms Linearly separable data: if all the data points can be correctly classified by a linear (hyperplanar) decision boundary 5

Linearly separable data Linear Decision boundary Class 1 Class 2 6

Linearly separable data Linear Decision boundary Class 1 Class 2 6

Non linearly separable data Class 1 Class 2 7

Non linearly separable data Class 1 Class 2 7

Non linearly separable data Non Linear Classifier Class 1 Class 2 8

Non linearly separable data Non Linear Classifier Class 1 Class 2 8

Simple linear algorithms Perceptron and Winnow algorithm Binary classification Online (process data sequentially, one

Simple linear algorithms Perceptron and Winnow algorithm Binary classification Online (process data sequentially, one data point at the time) Mistake-driven 11

Linear binary classification Data: {(xi, yi)}i=1. . . n x in Rd (x is

Linear binary classification Data: {(xi, yi)}i=1. . . n x in Rd (x is a vector in d-dimensional space) feature vector y in {-1, +1} label (class, category) Question: Design a linear decision boundary: wx + b (equation of hyperplane) such that the classification rule associated with it has minimal probability of error classification rule: – y = sign(w x + b) which means: – if wx + b > 0 then y = +1 (positive example) – if wx + b < 0 then y = -1 (negative example) From Gert Lanckriet, Statistical Learning Theory Tutorial 12

Linear binary classification Find a good hyperplane (w, b) in Rd+1 that correctly classifies

Linear binary classification Find a good hyperplane (w, b) in Rd+1 that correctly classifies data points as much as possible In online fashion: try one data point at the time, update weights as necessary wx + b = 0 Classification Rule: y = sign(wx + b) From Gert Lanckriet, Statistical Learning Theory Tutorial 13

Perceptron algorithm Initialize: w 1 = 0 Updating rule For each data point x

Perceptron algorithm Initialize: w 1 = 0 Updating rule For each data point x If class(x) != decision(x, w) wk then wk+1 wk + yixi k k+1 else wk+1 wk Function decision(x, w) If wx + b > 0 return +1 Else return -1 From Gert Lanckriet, Statistical Learning Theory Tutorial 0 +1 wk+1 -1 wk x + b = 0 Wk+1 x + b = 0 14

Perceptron algorithm Online: can adjust to changing target, over time Advantages Simple and computationally

Perceptron algorithm Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem (convergence, global optimum) Limitations Only linear separations Only converges for linearly separable data Not really “efficient with many features” From Gert Lanckriet, Statistical Learning Theory Tutorial 15

Winnow algorithm Another online algorithm for learning perceptron weights: f(x) = sign(wx + b)

Winnow algorithm Another online algorithm for learning perceptron weights: f(x) = sign(wx + b) Linear, binary classification Update-rule: again error-driven, but multiplicative (instead of additive) From Gert Lanckriet, Statistical Learning Theory Tutorial 16

Winnow algorithm Initialize: w 1 = 0 Updating rule For each data point x

Winnow algorithm Initialize: w 1 = 0 Updating rule For each data point x wk If class(x) != decision(x, w) then wk+1 wk + yixi Perceptron wk+1 wk *exp(yixi) Winnow k k+1 else wk+1 wk Function decision(x, w) If wx + b > 0 return +1 Else return -1 From Gert Lanckriet, Statistical Learning Theory Tutorial 0 +1 wk+1 -1 wk x + b= 0 Wk+1 x + b = 0 17

Perceptron vs. Winnow Assume N available features only K relevant items, with K<<N Perceptron:

Perceptron vs. Winnow Assume N available features only K relevant items, with K<<N Perceptron: number of mistakes: O( K N) Winnow: number of mistakes: O(K log N) Winnow is more robust to high-dimensional feature spaces From Gert Lanckriet, Statistical Learning Theory Tutorial 18

Perceptron vs. Winnow Perceptron Online: can adjust to changing target, over time Advantages Winnow

Perceptron vs. Winnow Perceptron Online: can adjust to changing target, over time Advantages Winnow Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem Suitable for problems with many irrelevant attributes Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Used in NLP From Gert Lanckriet, Statistical Learning Theory Tutorial 19

Large margin classifier Another family of linear algorithms Intuition (Vapnik, 1965) If the classes

Large margin classifier Another family of linear algorithms Intuition (Vapnik, 1965) If the classes are linearly separable: Separate the data Place hyper-plane “far” from the data: large margin Statistical results guarantee good generalization BAD From Gert Lanckriet, Statistical Learning Theory Tutorial 20

Large margin classifier Intuition (Vapnik, 1965) if linearly separable: Separate the data Place hyperplane

Large margin classifier Intuition (Vapnik, 1965) if linearly separable: Separate the data Place hyperplane “far” from the data: large margin Statistical results guarantee good generalization GOOD Maximal Margin Classifier From Gert Lanckriet, Statistical Learning Theory Tutorial 21

Large margin classifier If not linearly separable Allow some errors Still, try to place

Large margin classifier If not linearly separable Allow some errors Still, try to place hyperplane “far” from each class From Gert Lanckriet, Statistical Learning Theory Tutorial 22

Large Margin Classifiers Advantages Theoretically better (better error bounds) Limitations Computationally more expensive, large

Large Margin Classifiers Advantages Theoretically better (better error bounds) Limitations Computationally more expensive, large quadratic programming 23

Support Vector Machine (SVM) M Large Margin Classifier w. T x a + b

Support Vector Machine (SVM) M Large Margin Classifier w. T x a + b = 1 w. Txb + b = -1 Linearly separable case Goal: find the hyperplane that maximizes the margin w. T x + b = 0 Support vectors From Gert Lanckriet, Statistical Learning Theory Tutorial 24

Support Vector Machine (SVM) Applications Text classification Hand-writing recognition Computational biology (e. g. ,

Support Vector Machine (SVM) Applications Text classification Hand-writing recognition Computational biology (e. g. , micro-array data) Face detection Face expression recognition Time series prediction From Gert Lanckriet, Statistical Learning Theory Tutorial 25

Non Linear problem 26

Non Linear problem 26

Non Linear problem 27

Non Linear problem 27

Non Linear problem Kernel methods A family of non-linear algorithms Transform the non linear

Non Linear problem Kernel methods A family of non-linear algorithms Transform the non linear problem in a linear one (in a different feature space) Use linear algorithms to solve the linear problem in the new space From Gert Lanckriet, Statistical Learning Theory Tutorial 28

Basic principle kernel methods : Rd RD (D >> d) w. T (x)+b=0 (X)=[x

Basic principle kernel methods : Rd RD (D >> d) w. T (x)+b=0 (X)=[x 2 z 2 xz] X=[x z] f(x) = sign(w 1 x 2+w 2 z 2+w 3 xz +b) From Gert Lanckriet, Statistical Learning Theory Tutorial 29

Basic principle kernel methods Linear separability: more likely in high dimensions Mapping: maps input

Basic principle kernel methods Linear separability: more likely in high dimensions Mapping: maps input into high-dimensional feature space Classifier: construct linear classifier in highdimensional feature space Motivation: appropriate choice of leads to linear separability We can do this efficiently! From Gert Lanckriet, Statistical Learning Theory Tutorial 30

Multi. Layer Neural Networks Also known as a multi-layer perceptron Also known as artificial

Multi. Layer Neural Networks Also known as a multi-layer perceptron Also known as artificial neural networks, to distinguish from the biological ones Many learning algorithms, but most popular is backpropagation The output values are compared with the correct answer to compute the value of some predefined error-function. Propagate the errors back through the network Adjust the weights to reduce the errors Continue iterating some number of times. Can be linear or nonlinear Tends to work very well, but Is very slow to run Isn’t great with huge feature sets (slow and memoryintensive) 32

Multilayer Neural Network Applied to Sentence Boundary Detection Features in Descriptor Array From Palmer

Multilayer Neural Network Applied to Sentence Boundary Detection Features in Descriptor Array From Palmer & Hearst '97 33

Multilayer Neural Networks Backpropagation algorithm: Present a training sample to the neural network. Compare

Multilayer Neural Networks Backpropagation algorithm: Present a training sample to the neural network. Compare the network's output to the desired output from that sample. Calculate the error in each output neuron. For each neuron, calculate what the output should have been, and a scaling factor, how much lower or higher the output must be adjusted to match the desired output. This is the local error. Adjust the weights of each neuron to lower the local error. Assign "blame" for the local error to neurons at the previous level, giving greater responsibility to neurons connected by stronger weights. Repeat the steps above on the neurons at the previous level, using each one's "blame" as its error. For a detailed example, see: http: //galaxy. agh. edu. pl/~vlsi/AI/backp_t_en/backprop. html From Wikipedia article on backpropagation 34

Multi-classification 35

Multi-classification 35

Multi-classification Given: some data items that belong to one of M possible classes Task:

Multi-classification Given: some data items that belong to one of M possible classes Task: Train the classifier and predict the class for a new data item Geometrically: harder problem, no more simple geometry 36

Multi-classification: Examples Author identification Language identification Text categorization (topics) 37

Multi-classification: Examples Author identification Language identification Text categorization (topics) 37

(Some) Algorithms for Multi-classification Linear Decision trees, Naïve Bayes Non Linear K-nearest neighbors Neural

(Some) Algorithms for Multi-classification Linear Decision trees, Naïve Bayes Non Linear K-nearest neighbors Neural Networks 38

Linear class separators (ex: Naïve Bayes) 39

Linear class separators (ex: Naïve Bayes) 39

Non Linear (ex: k Nearest Neighbor) 40

Non Linear (ex: k Nearest Neighbor) 40

Decision Trees Decision tree is a classifier in the form of a tree structure,

Decision Trees Decision tree is a classifier in the form of a tree structure, where each node is either: Leaf node - indicates the value of the target attribute (class) of examples, or Decision node - specifies some test to be carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test. A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance. http: //dms. irb. hr/tutorial/tut_dtrees. php 41

Decision Tree Example Goal: learn when we can play Tennis and when we cannot

Decision Tree Example Goal: learn when we can play Tennis and when we cannot Day Outlook Temp. Humidity Wind Play Tennis D 1 Sunny Hot High Weak No D 2 Sunny Hot High Strong No D 3 Overcast Hot High Weak Yes D 4 Rain Mild High Weak Yes D 5 Rain Cool Normal Weak Yes D 6 Rain Cool Normal Strong No D 7 Overcast Cool Normal Weak Yes D 8 Sunny Mild High Weak No D 9 Sunny Cold Normal Weak Yes D 10 Rain Mild Normal Strong Yes D 11 Sunny Mild Normal Strong Yes D 12 Overcast Mild High Strong Yes D 13 Overcast Hot Normal Weak Yes D 14 Rain Mild High Strong No 42

Decision Tree for Play. Tennis Outlook Sunny Humidity High No Overcast Rain Yes Normal

Decision Tree for Play. Tennis Outlook Sunny Humidity High No Overcast Rain Yes Normal Yes www. math. tau. ac. il/~nin/ Courses/ML 04/Decision. Trees. CLS. pp Wind Strong No Weak Yes 43

Decision Tree for Play. Tennis Outlook Sunny Humidity High No Overcast Rain Each internal

Decision Tree for Play. Tennis Outlook Sunny Humidity High No Overcast Rain Each internal node tests an attribute Normal Yes www. math. tau. ac. il/~nin/ Courses/ML 04/Decision. Trees. CLS. pp Each branch corresponds to an attribute value node Each leaf node assigns a classification 44

Decision Tree for Play. Tennis Outlook Temperature Humidity Wind Sunny Hot High Weak Play.

Decision Tree for Play. Tennis Outlook Temperature Humidity Wind Sunny Hot High Weak Play. Tennis ? No Outlook Sunny Humidity High No Overcast Rain Yes Normal Yes www. math. tau. ac. il/~nin/ Courses/ML 04/Decision. Trees. CLS. pp Wind Strong No Weak Yes 45

Decision Tree for Reuter classification Foundations of Statistical Natural Language Processing, Manning and Schuetze

Decision Tree for Reuter classification Foundations of Statistical Natural Language Processing, Manning and Schuetze 46

Decision Tree for Reuter classification Foundations of Statistical Natural Language Processing, Manning and Schuetze

Decision Tree for Reuter classification Foundations of Statistical Natural Language Processing, Manning and Schuetze 47

Building Decision Trees Given training data, how do we construct them? The central focus

Building Decision Trees Given training data, how do we construct them? The central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the tree. The goal is to select the attribute that is most useful for classifying examples. Top-down, greedy search through the space of possible decision trees. That is, it picks the best attribute and never looks back to reconsider earlier choices. 48

Building Decision Trees Splitting criterion Finding the features and the values to split on

Building Decision Trees Splitting criterion Finding the features and the values to split on – for example, why test first “cts” and not “vs”? – Why test on “cts < 2” and not “cts < 5” ? Split that gives us the maximum information gain (or the maximum reduction of uncertainty) Stopping criterion When all the elements at one node have the same class, no need to split further In practice, one first builds a large tree and then one prunes it back (to avoid overfitting) See Foundations of Statistical Natural Language Processing , Manning and Schuetze for a good introduction 49

Decision Trees: Strengths Decision trees are able to generate understandable rules. Decision trees perform

Decision Trees: Strengths Decision trees are able to generate understandable rules. Decision trees perform classification without requiring much computation. Decision trees are able to handle both continuous and categorical variables. Decision trees provide a clear indication of which features are most important for prediction or classification. http: //dms. irb. hr/tutorial/tut_dtrees. php 50

Decision Trees: Weaknesses Decision trees are prone to errors in classification problems with many

Decision Trees: Weaknesses Decision trees are prone to errors in classification problems with many classes and relatively small number of training examples. Decision tree can be computationally expensive to train. Need to compare all possible splits Pruning is also expensive Most decision-tree algorithms only examine a single field at a time. This leads to rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space. http: //dms. irb. hr/tutorial/tut_dtrees. php 51

Naïve Bayes Models Graphical Models: graph theory plus probability theory Nodes are variables Edges

Naïve Bayes Models Graphical Models: graph theory plus probability theory Nodes are variables Edges are conditional probabilities A B C P(A) P(B|A) P(C|A) 52

Naïve Bayes Models Graphical Models: graph theory plus probability theory Nodes are variables Edges

Naïve Bayes Models Graphical Models: graph theory plus probability theory Nodes are variables Edges are conditional probabilities Absence of an edge between nodes implies independence between the variables of the nodes A B C P(A) P(B|A) P(C|A) P(C|A, B) 53

Naïve Bayes for text classification Foundations of Statistical Natural Language Processing, Manning and Schuetze

Naïve Bayes for text classification Foundations of Statistical Natural Language Processing, Manning and Schuetze 54

Naïve Bayes for text classification earn Shr 34 cts vs per shr 55

Naïve Bayes for text classification earn Shr 34 cts vs per shr 55

Naïve Bayes for text classification Topic w 1 w 2 w 3 w 4

Naïve Bayes for text classification Topic w 1 w 2 w 3 w 4 wn-1 wn The words depend on the topic: P(wi| Topic) P(cts|earn) > P(tennis| earn) Naïve Bayes assumption: all words are independent given the topic From training set we learn the probabilities P(wi| Topic) for each word and for each topic in the training set 56

Naïve Bayes for text classification Topic w 1 w 2 w 3 w 4

Naïve Bayes for text classification Topic w 1 w 2 w 3 w 4 wn-1 wn To: Classify new example Calculate P(Topic | w 1, w 2, … wn) for each topic Bayes decision rule: Choose the topic T’ for which P(T’ | w 1, w 2, … wn) > P(T | w 1, w 2, … wn) for each T T’ 57

Naïve Bayes: Strengths Very simple model Easy to understand Very easy to implement Very

Naïve Bayes: Strengths Very simple model Easy to understand Very easy to implement Very efficient, fast training and classification Modest space storage Widely used because it works somewhat well for text categorization Linear, but non parallel, decision boundaries 59

Naïve Bayes: Weaknesses Naïve Bayes independence assumption: Ignores the sequential ordering of words (uses

Naïve Bayes: Weaknesses Naïve Bayes independence assumption: Ignores the sequential ordering of words (uses bag of words model) Naïve Bayes assumption is inappropriate if there are strong conditional dependencies between the variables But even if the model is not “right”, Naïve Bayes models do well in a surprisingly large number of cases because often we are interested in classification accuracy and not in accurate probability estimations 60

Multinomial Naïve Bayes (Based on a paper by Mc. Callum & Nigram ’ 98)

Multinomial Naïve Bayes (Based on a paper by Mc. Callum & Nigram ’ 98) Features include the number of times words occur in the document, not binary (present/absent) indicators Uses a statistical formula known as the multinomial distribution. Authors compared, on several text classification tasks: Multinomial naïve bayes Binary-featured multi-variate Bernoulli-distributed Results: Multinomial much better when using large vocabularies. However, they note that Bernoulli can handle other features (e. g. , from-title) as numbers, whereas this will confuse the multinomial version. Andrew Mc. Callum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification In AAAI/ICML-98 Workshop on Learning for Text Categorization. 61

k Nearest Neighbor Classification Nearest Neighbor classification rule: to classify a new object, find

k Nearest Neighbor Classification Nearest Neighbor classification rule: to classify a new object, find the object in the training set that is most similar. Then assign the category of this nearest neighbor K Nearest Neighbor (KNN): consult k nearest neighbors. Decision based on the majority category of these neighbors. More robust than k = 1 Example of similarity measure often used in NLP is cosine similarity 62

1 -Nearest Neighbor 63

1 -Nearest Neighbor 63

1 -Nearest Neighbor 64

1 -Nearest Neighbor 64

3 -Nearest Neighbor 65

3 -Nearest Neighbor 65

3 -Nearest Neighbor But this is closer. . We can weight neighbors according to

3 -Nearest Neighbor But this is closer. . We can weight neighbors according to their similarity Assign the category of the majority of the neighbors 66

k Nearest Neighbor Classification Strengths Robust Conceptually simple Often works well Powerful (arbitrary decision

k Nearest Neighbor Classification Strengths Robust Conceptually simple Often works well Powerful (arbitrary decision boundaries) Weaknesses Performance is very dependent on the similarity measure used (and to a lesser extent on the number of neighbors k used) Finding a good similarity measure can be difficult Computationally expensive 67

Summary Algorithms for Classification Linear versus non linear classification Binary classification Perceptron Winnow Support

Summary Algorithms for Classification Linear versus non linear classification Binary classification Perceptron Winnow Support Vector Machines (SVM) Kernel Methods Multilayer Neural Networks Multi-Class classification Decision Trees Naïve Bayes K nearest neighbor 68

Next Time More learning algorithms Clustering 69

Next Time More learning algorithms Clustering 69