Last lecture summary SVM Support Vector Machine Supervised

  • Slides: 69
Download presentation
Last lecture summary (SVM)

Last lecture summary (SVM)

 • Support Vector Machine • Supervised algorithm • Works both as – classifier

• Support Vector Machine • Supervised algorithm • Works both as – classifier (binary) – regressor • De facto better linear classification • Two main ingrediences: – maximum margin – kernel functions

Maximum margin Which line is best?

Maximum margin Which line is best?

Define the margin of a linear classifier as the width that the boundary could

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. The maximum margin linear classifier is the optimum linear classifier. This is the simplest kind of SVM (linear SVM) • Maximum margin intuitively feels safest. • Only support vectors are important. • Works very well.

 • The decision boundary is found by constrained quadratic optimization. • The solution

• The decision boundary is found by constrained quadratic optimization. • The solution is found in the form Lagrange multiplier • Only points on the margin (i. e. support vectors xi) have αi > 0.

 • w does not to be explicitly formed, because: • Training SVM: find

• w does not to be explicitly formed, because: • Training SVM: find the sets of the parameters αi and b. • Classification with SVM:

 • soft margin – Allows misclassification errors. – i. e. misclassified points are

• soft margin – Allows misclassification errors. – i. e. misclassified points are allowed to be inside the margin. – The penalty to classification errors is given by the capacity parameter C (user adjustable parameter). – Large C – a high penalty to classification errors. – Decrease in C: points move inside margin.

CSE 802. Prepared by Martin Law

CSE 802. Prepared by Martin Law

Kernel functions • Soft margin introduces the possibility to linearly classify the linearly non-separable

Kernel functions • Soft margin introduces the possibility to linearly classify the linearly non-separable data sets. • What else could be done? Can we propose an approach generating non-linear classification boundary just by extending the linear classifier machinery?

X

X

Kernels • Linear (dot) kernel • Polynomial – simple, efficient for non-linear relationships –

Kernels • Linear (dot) kernel • Polynomial – simple, efficient for non-linear relationships – degree • Gaussian

Finishing SVM

Finishing SVM

SVM parameters • Training sets the parameters αi and b. • The SVM has

SVM parameters • Training sets the parameters αi and b. • The SVM has another set of parameters called hyperparameters. – The soft margin constant C. – Any parameters the kernel function depends on • linear kernel – no hyperparameter (except for C) • polynomial – degree • Gaussian – width of Gaussian

 • So which kernel and which parameters should I use? • The answer

• So which kernel and which parameters should I use? • The answer is data-dependent. • Several kernels should be tried. • Try linear kernel first and see, if the classification can be improved with nonlinear kernels (tradeoff between quality of the kernel and the number of dimensions). • Select kernel + parameters + C by crossvalidation.

Computational aspects • Classification of new samples is very quick, training is longer (reasonably

Computational aspects • Classification of new samples is very quick, training is longer (reasonably fast for thousands of samples). • Linear kernel – scales linearly. • Nonlinear kernels – scale quadratically.

Multiclass SVM • SVM is defined for binary classification. • How to predict more

Multiclass SVM • SVM is defined for binary classification. • How to predict more than two classes (multiclass)? • Simplest approach: decompose the multiclass problem into several binary problems and train several binary SVM’s.

 • 1/2 1/3 1 1/4 2/3 2/4 3 1 1 3/4 4 4

• 1/2 1/3 1 1/4 2/3 2/4 3 1 1 3/4 4 4 1

 • 1/rest 2/rest 3/rest 4/rest

• 1/rest 2/rest 3/rest 4/rest

Resources • SVM and Kernels for Comput. Biol. , Ratsch et al. , PLOS

Resources • SVM and Kernels for Comput. Biol. , Ratsch et al. , PLOS Comput. Biol. , 4 (10), 1 -10, 2008 • What is a support vector machine, W. S. Noble, Nature Biotechnology, 24 (12), 1565 -1567, 2006 • A tutorial on SVM for pattern recognition, C. J. C. Burges, Data Mining and Knowledge Discovery, 2, 121 -167, 1998 • A User’s Guide to Support Vector Machines, Asa Ben-Hur, Jason Weston

 • http: //support-vector-machines. org/ • http: //www. kernel-machines. org/ • http: //www. support-vector.

• http: //support-vector-machines. org/ • http: //www. kernel-machines. org/ • http: //www. support-vector. net/ – companion to the book An Introduction to Support Vector Machines by Cristianini and Shawe-Taylor • http: //www. kernel-methods. net/ – companion to the book Kernel Methods for Pattern Analysis by Shawe-Taylor and Cristianini • http: //www. learning-with-kernels. org/ – Several chapters on SVM from the book Learning with Kernels by Scholkopf and Smola are available from this site

Software • SVMlight – one of the most widely used SVM package. fast optimization,

Software • SVMlight – one of the most widely used SVM package. fast optimization, can handle very large datasets, very efficient implementation of the leave–one–out cross-validation, C++ code • SVMstruct - can model complex data, such as trees, sequences, or sets • LIBSVM – multiclass, weighted SVM for unbalanced data, cross-validation, automatic model selection, C++, Java

Naïve Bayes Classifier

Naïve Bayes Classifier

Example – Play Tennis

Example – Play Tennis

Example – Learning Phase P(Outlook=Sunny|Play=Yes) = 2/9 Outlook Play=Yes Play=No Temperature Play=Yes Play=No Sunny

Example – Learning Phase P(Outlook=Sunny|Play=Yes) = 2/9 Outlook Play=Yes Play=No Temperature Play=Yes Play=No Sunny 2/9 4/9 3/5 0/5 2/5 Hot 2/9 4/9 3/9 2/5 1/5 Overcast Rain Mild Cool Humidity Play=Yes Play=No Wind Play=Yes Play=No High 3/9 6/9 4/5 1/5 Strong 3/9 6/9 3/5 2/5 Normal P(Play=Yes) = 9/14 Weak P(Play=No) = 5/14

Example - prediction • Answer this question: “Will we play tennis given that it’s

Example - prediction • Answer this question: “Will we play tennis given that it’s cool but sunny, humidity is high and it is blowing a strong wind? ” • i. e. , predict this new instace: x’=(Outl=Sunny, Temp=Cool, Hum=High, Wind=Strong) • Good strategy is to predict arg max P(Y|cool, sunny, high, strong) where Y is Yes or No.

Example - Prediction x’=(Outl=Sunny, Temp=Cool, Hum=High, Wind=Strong) Look up tables P(Outl=Sunny|Play=Yes) = 2/9 P(Outl=Sunny|Play=No)

Example - Prediction x’=(Outl=Sunny, Temp=Cool, Hum=High, Wind=Strong) Look up tables P(Outl=Sunny|Play=Yes) = 2/9 P(Outl=Sunny|Play=No) = 3/5 P(Temp=Cool|Play=Yes) = 3/9 P(Temp=Cool|Play=No) = 1/5 P(Hum=High|Play=Yes) = 3/9 P(Hum=High|Play=No) = 4/5 P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0. 0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0. 0206 Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

Another Application • Digit Recognition Classifier • X 1, …, Xn {0, 1} (Black

Another Application • Digit Recognition Classifier • X 1, …, Xn {0, 1} (Black vs. White pixels) • Y {5, 6} (predict whether a digit is a 5 or a 6) 5

Bayes Rule So how do we compute posterior probability that the image represents a

Bayes Rule So how do we compute posterior probability that the image represents a 5 given its pixels? Posterior Likelihood Prior Normalization Constant Why did this help? Well, we think that we might be able to specify how features are “generated” by the class label (i. e. we will try to compute likelihood).

 • Let’s expand this for our digit recognition task: • To classify, we’ll

• Let’s expand this for our digit recognition task: • To classify, we’ll simply compute these two probabilities and predict based on which one is greater. • For the Bayes classifier, we need to “learn” two functions, the likelihood and the prior.

Learning prior • Let us assume training examples are generated by drawing instances at

Learning prior • Let us assume training examples are generated by drawing instances at random from an unknown underlying distribution P(Y), then allow a teacher to label this example with its Y value. • A hundred independently drawn training examples will usually suffice to obtain a reasonable estimate of P(Y).

Learning likelihood •

Learning likelihood •

 • So this corresponds to two distinct parameters for each of the distinct

• So this corresponds to two distinct parameters for each of the distinct instances in the instance space for X. • Worse yet, to obtain reliable estimates of each of these parameters, we will need to observe each of these distinct instances multiple times. • For example, if X is a vector containing 30 boolean features, then we will need to estimate more than 3 billion parameters!

 • The problem with explicitly modeling P(X 1, …, Xn|Y) is that there

• The problem with explicitly modeling P(X 1, …, Xn|Y) is that there are usually way too many parameters: – We’ll run out of space. – We’ll run out of time. – And we’ll need tons of training data (which is usually not available).

The Naïve Bayes Model •

The Naïve Bayes Model •

Naïve Bayes Training MNIST Training Data

Naïve Bayes Training MNIST Training Data

Naïve Bayes Training • Training in Naïve Bayes is easy: – Estimate P(Y=v) as

Naïve Bayes Training • Training in Naïve Bayes is easy: – Estimate P(Y=v) as the fraction of records with Y=v – Estimate P(Xi=u|Y=v) as the fraction of records with Y=v for which Xi=u

Naïve Bayes Training • In practice, some of these counts can be zero •

Naïve Bayes Training • In practice, some of these counts can be zero • Fix this by adding “virtual” counts: • This is called Smoothing.

Naïve Bayes Training For binary digits, training amounts to averaging all of the training

Naïve Bayes Training For binary digits, training amounts to averaging all of the training fives together and all of the training sixes together.

Naïve Bayes Classification

Naïve Bayes Classification

Assorted remarks • What’s nice about Naïve Bayes is that it returns probabilities –

Assorted remarks • What’s nice about Naïve Bayes is that it returns probabilities – These probabilities can tell us how confident the algorithm is – So… don’t throw away these probabilities! • Naïve Bayes assumption is almost never true – Still… Naïve Bayes often performs surprisingly well even when its assumptions do not hold. – Very good method in text processing.

Binary classifier performance

Binary classifier performance

Confusion matrix also called a contingency table Known Label TP TN FP FN Predicted

Confusion matrix also called a contingency table Known Label TP TN FP FN Predicted positive Label negative positive negative TP FN FP TN True Positives – is positive and is classified as positive True Negatives – is negative and is classified as negative False Positives – is negative, but is classified as positive False Negatives – is positive, but is classified as negative

Accuracy Known Label Predicted positive Label negative positive negative TP FN FP TN Accuracy

Accuracy Known Label Predicted positive Label negative positive negative TP FN FP TN Accuracy = (TP + TN) / (TP + TN + FP + FN)

Information retrieval (IR) • A query by the user – to find the documents

Information retrieval (IR) • A query by the user – to find the documents in the database. • IR systems allow to narrow down the set of documents that are relevant to a particular problem.

documents containing what I am looking for documents not containing what I am looking

documents containing what I am looking for documents not containing what I am looking for

FN TP FP TN How many of the things I consider to be true

FN TP FP TN How many of the things I consider to be true are actually true? Precision = TP / (TP + FP) How much of the true things do I find? Recall = TP / (TP + FN)

Precision TP FP • A measure of exactness • A perfect precision score of

Precision TP FP • A measure of exactness • A perfect precision score of 1. 0 means that every result retrieved by a search was relevant. • But says nothing about whether all relevant documents were retrieved.

Recall FN TP • A measure of completeness • A perfect recall score of

Recall FN TP • A measure of completeness • A perfect recall score of 1. 0 means that all relevant documents were retrieved by the search. • But says nothing about how many irrelevant documents were also retrieved.

Precission-Recall tradeoff • Returning all documents lead to the perfect recall of 1. 0.

Precission-Recall tradeoff • Returning all documents lead to the perfect recall of 1. 0. – i. e. all relevant documents are present in the returned set • However, precission is not that great, as not every result is relevant. • Apparently, the relationship between them is inverse – it is possible to increase one at the cost of reducing the other. • They are not discussed in isolation. – Either values for one measure are compared for a fixed level at the other measure (e. g. precision at the recall level of 0. 75) – Combine both into the F-measure.

F-measure • Common F 1 measure • General Fβ measure • β - relative

F-measure • Common F 1 measure • General Fβ measure • β - relative value of precision – β = 1 – weight precision and recall by the same amount – β < 1 – more weight on precision – β > 1 – more weight on recall

Sensitivity & Specificity • Measure how ‘good’ a test is at detecting binary features

Sensitivity & Specificity • Measure how ‘good’ a test is at detecting binary features of interest (disease/no disease). • There are 100 patients, 30 have disease A. • A test designed to identify who has the disease and who does not. • We want to evaluate how good the test is.

Sensitivity & Specificity Disease + Disease - Test + 25 2 Test - 5

Sensitivity & Specificity Disease + Disease - Test + 25 2 Test - 5 68

Sensitivity & Specificity Disease + Disease - Total Test + 25 2 27 Test

Sensitivity & Specificity Disease + Disease - Total Test + 25 2 27 Test - 5 68 73 Total 30 70 100 25/30 sensitivity 68/70 specificity

 • Sensitivity measures the proportion of actual positives which are correctly identified as

• Sensitivity measures the proportion of actual positives which are correctly identified as such (e. g. the percentage of sick people who are identified as sick). TP/(TP + FN) • Specificity measures the proportion of negatives which are correctly identified (e. g. the percentage of healthy people who are identified as healthy). TN/(TN + FP)

Performance Evaluation Precision, Positive Predictive Value (PPV) TP / (TP + FP) Recall, Sensitivity,

Performance Evaluation Precision, Positive Predictive Value (PPV) TP / (TP + FP) Recall, Sensitivity, True Positive Rate (TPR), Hit rate TP / P = TP/(TP + FN) False Positive Rate (FPR), Fall-out FP / N = FP / (FP + TN) Specificity, True Negative Rate (TNR) TN / (TN + FP) = 1 - FPR Accuracy (TP + TN) / (TP + TN + FP + FN)

Types of classifiers • A discrete (crisp) classifier – Output is only a class

Types of classifiers • A discrete (crisp) classifier – Output is only a class label, e. g. decision tree. • A soft classifier – Yield a probability (score, confidence) for the given pattern. – Number representing the degree to which an instance is a member of a class. – Use threshold to assign to (+) or to (-) class. – e. g. SVM, NN, naïve Bayes.

ROC Graph • Receiver Operating Characteristics. • Plot TPR vs. FPR • Sensitivity vs.

ROC Graph • Receiver Operating Characteristics. • Plot TPR vs. FPR • Sensitivity vs. (1 – Specificity). • TPR is on the Y axis, FPR on the X axis. • An ROC graph depicts relative trade-offs between benefits (true positives) and costs (false positives).

perfect classification random guess perfect classification always issue positive classification better 0. 5, 0.

perfect classification random guess perfect classification always issue positive classification better 0. 5, 0. 5 worse never issue positive classification Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27, 861– 874.

They make positive classifications only with strong evidence so they make few false positive

They make positive classifications only with strong evidence so they make few false positive errors, but they often have low true positive rates as well conservative classifiers liberal classifiers They make positive classifications with weak evidence so they classify nearly all positives correctly, but they often have high false positive rates A is more conservative than B.

ROC Curve Fawcet, ROC Graphs: Notes and Practical Considerations for Researchers

ROC Curve Fawcet, ROC Graphs: Notes and Practical Considerations for Researchers

lowering threshold corresponds to moving from the conservative to the liberal areas of the

lowering threshold corresponds to moving from the conservative to the liberal areas of the graph Fawcet, ROC Graphs: Notes and Practical Considerations for Researchers

AUC • To compare classifiers we may want to reduce ROC performance to a

AUC • To compare classifiers we may want to reduce ROC performance to a single scalar value representing expected performance. • A common method is to calculate the area under the ROC curve, abbreviated AUC. • Its value will always be between 0 and 1. 0. • Random guessing has an area 0. 5. • Any realistic classifier should have an AUC between 0. 5 and 1. 0.

The AUC has an important statistical property: the AUC of a classifier is equivalent

The AUC has an important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Classier B is generally better than A, it has higher AUC. However, with the exception at FPR > 0. 55 where A has slight advantage. So it is possible for a high-AUC classifier to perform worse in a specific region of ROC space than a low-AUC classifier. But in practice the AUC performs very well and is often used when a general measure of predictiveness is desired.