Support Vector Machines Note to other teachers and
- Slides: 85
Support Vector Machines Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. Power. Point originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http: //www. cs. cmu. edu/~awm/tutorials. Comments and corrections gratefully received. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www. cs. cmu. edu/~awm awm@cs. cmu. edu 412 -268 -7599 Slides Modified for Comp 537, Spring, 2006, HKUST Copyright © 2001, 2003, Andrew W. Moore Nov 23 rd, 2001
History • • Copyright © 2001, 2003, Andrew W. Moore SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis SVMs introduced by Boser, Guyon, Vapnik in COLT-92 Initially popularized in the NIPS community, now an important and active field of all Machine Learning research. Special issues of Machine Learning Journal, and Journal of Machine Learning Research. Support Vector Machines: Slide 2
Roadmap • • • Hard-Margin Linear Classifier • Maximize Margin • Support Vector • Quadratic Programming Soft-Margin Linear Classifier • Maximize Margin • Support Vector • Quadratic Programming Non-Linear Separable Problem • XOR Transform to Non-Linear by Kernels Reference Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 3
Linear Classifiers x denotes +1 f yest f(x, w, b) = sign(w. x - b) denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 4
Linear Classifiers x denotes +1 f yest f(x, w, b) = sign(w. x - b) denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 5
Linear Classifiers x denotes +1 f yest f(x, w, b) = sign(w. x - b) denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 6
Linear Classifiers x denotes +1 f yest f(x, w, b) = sign(w. x - b) denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 7
Linear Classifiers x denotes +1 f yest f(x, w, b) = sign(w. x - b) denotes -1 Any of these would be fine. . but which is best? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 8
Classifier Margin x denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore f yest f(x, w, b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. Support Vector Machines: Slide 9
Maximum Margin x denotes +1 f yest f(x, w, b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. denotes -1 This is the simplest kind of SVM (Called an LSVM) Linear SVM Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 10
Maximum Margin x denotes +1 f yest f(x, w, b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. denotes -1 Support Vectors are those datapoints that the margin pushes up against This is the simplest kind of SVM (Called an LSVM) Linear SVM Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 11
Why Maximum Margin? 1. Intuitively this feels safest. denotes +1 2. f(x, w, b) = sign(w. x - b) Empirically it works very well. denotes -1 3. Support Vectors are those datapoints that the margin pushes up against 4. 5. If we’ve made a small error in the The maximum location of the boundary (it’s been margin direction) linear jolted in its perpendicular the a this gives us leastclassifier chance ofiscausing misclassification. linear classifier withthethe, um, is LOOCV is easy since model maximum margin. immune to removal of any nonsupport-vector datapoints. This is the simplest There’s some theory (using VC kind of SVM dimension) that is related to (but not an that LSVM) the same as) the(Called proposition this is a good thing. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 12
Estimate the Margin denotes +1 denotes -1 x W wx +b = 0 X – Vector W – Normal Vector b – Scale Value • What is the distance expression for a point x to a line wx+b= 0? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 13
Estimate the Margin denotes +1 denotes -1 wx +b = 0 Margin • What is the expression for margin? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 14
Maximize Margin denotes +1 denotes -1 wx +b = 0 Margin Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 15
Maximize Margin denotes +1 denotes -1 wx +b = 0 WXi+b≥ 0 iff yi=1 WXi+b≤ 0 iff yi=-1 Margin yi(WXi+b) ≥ 0 • Min-max problem game problem Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 16
Maximize Margin denotes +1 denotes -1 WXi+b≥ 0 iff yi=1 wx +b = 0 WXi+b≤ 0 iff yi=-1 yi(WXi+b) ≥ 0 Margin Strategy: wx +b = 0 α(wx +b) = 0 where α≠ 0 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 17
Maximize Margin • How does it come ? We have Thus, Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 18
Maximum Margin Linear Classifier • How to solve it? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 19
Learning via Quadratic Programming • • QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints. Detail solution of Quadratic Programming • Convex Optimization Stephen P. Boyd • Online Edition, Free for Downloading Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 20
Quadratic Programming Find Quadratic criterion Subject to n additional linear inequality constraints Copyright © 2001, 2003, Andrew W. Moore e additional linear equality constraints And subject to Support Vector Machines: Slide 21
Quadratic Programming for the Linear Classifier Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 22
Online Demo • Popular Tools - Lib. SVM Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 23
Roadmap • • • Hard-Margin Linear Classifier • Maximize Margin • Support Vector • Quadratic Programming Soft-Margin Linear Classifier • Maximize Margin • Support Vector • Quadratic Programming Non-Linear Separable Problem • XOR Transform to Non-Linear by Kernels Reference Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 24
Uh-oh! This is going to be a problem! What should we do? denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 25
Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1: Find minimum w. w, while minimizing number of training set errors. Problemette: Two things to minimize makes for an ill-defined optimization Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 26
Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1. 1: Minimize w. w + C (#train errors) Tradeoff parameter There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 27
Uh-oh! This is going to be a problem! What should we do? Idea 1. 1: denotes +1 denotes -1 Minimize w. w + C (#train errors) Tradeoff parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. There’s a serious practical (Also, doesn’t distinguish between problem that’s about … anyto make o S disastrous errors and near misses) ther o us reject this approach. Can ideas? you guess what it is? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 28
Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 2. 0: Minimize w. w + C (distance of error points to their correct place) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 29
Support Vector Machine (SVM) for Noisy Data denotes +1 denotes -1 • Any problem with the above formulism? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 30
Support Vector Machine (SVM) for Noisy Data denotes +1 denotes -1 • Balance the trade off between margin and classification errors Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 31
Support Vector Machine for Noisy Data How do we determine the appropriate value for c ? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 32
The Dual Form of QP Maximize where Subject to these constraints: Then define: Then classify with: f(x, w, b) = sign(w. x - b) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 33
The Dual Form of QP Maximize where Subject to these constraints: Then define: Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 34
An Equivalent QP Maximize where Subject to these constraints: Then define: Datapoints with ak > 0 will be the support vectors. . so this sum only needs to be over the support vectors. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 35
Support Vectors denotes +1 denotes -1 i = 0 for non-support vectors i 0 for support vectors Decision boundary is determined only by those support vectors ! Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 36
The Dual Form of QP Maximize where Subject to these constraints: Then define: Then classify with: f(x, w, b) = sign(w. x - b) How to determine b ? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 37
An Equivalent QP: Determine b Fix w A linear programming problem ! Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 38
An Equivalent QP Maximize where Why did I tell you about this Subject to these equivalent QP? constraints: • It’s a formulation that QP packages can optimize more Then define: Datapoints with ak > 0 quickly • will be the support vectors Then classify with: further jaw- Because of dropping developments f(x, w, b)you’re = sign(w. x - b) about to learn. . . so this sum only needs to be over the support vectors. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 39
Online Demo • Parameter c is used to control the fitness Noise Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 40
Roadmap • • • Hard-Margin Linear Classifier (Clean Data) • Maximize Margin • Support Vector • Quadratic Programming Soft-Margin Linear Classifier (Noisy Data) • Maximize Margin • Support Vector • Quadratic Programming Non-Linear Separable Problem • XOR Transform to Non-Linear by Kernels Reference Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 41
Feature Transformation ? • • The problem is non-linear Find some trick to transform the input Linear separable after Feature Transformation What Features should we use ? Basic Idea : XOR Problem Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 42
Suppose we’re in 1 -dimension What would SVMs do with this data? x=0 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 43
Suppose we’re in 1 -dimension Not a big surprise x=0 Positive “plane” Copyright © 2001, 2003, Andrew W. Moore Negative “plane” Support Vector Machines: Slide 44
Harder 1 -dimensional dataset That’s wiped the smirk off SVM’s face. What can be done about this? x=0 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 45
Harder 1 -dimensional dataset Map the data from low-dimensional space to high-dimensional space Let’s permit them here too x=0 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 46
Harder 1 -dimensional dataset Map the data from low-dimensional space to high-dimensional space Let’s permit them here too x=0 Feature Enumeration Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 47
Non-linear SVMs: Feature spaces • General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 48
Online Demo • Polynomial features for the XOR problem Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 49
Online Demo • But……Is it the best margin Intuitively? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 50
Online Demo • Why not something like this ? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 51
Online Demo • Or something like this ? Could We ? • A More Symmetric Boundary Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 52
Degree of Polynomial Features X^1 X^2 X^3 X^4 X^5 X^6 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 53
Towards Infinite Dimensions of Features • • Enuermate polynomial features of all degrees ? Taylor Expension of exponential function zk = ( radial basis functions of xk ) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 54
Online Demo • “Radius basis functions” for the XOR problem Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 55
Efficiency Problem in Computing Feature • Feature space Mapping • Example: all 2 degree Monomials This use of kernel function to avoid carrying out Φ(x) explicitly is known as the kernel trick 9 Multipllication kernel trick Copyright © 2001, 2003, Andrew W. Moore 3 Multipllication Support Vector Machines: Slide 56
Common SVM basis functions zk = ( polynomial terms of xk of degree 1 to q ) zk = ( radial basis functions of xk ) zk = ( sigmoid functions of xk ) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 57
Online Demo • • “Radius Basis Function” (Gaussian Kernel) Could solve complicated Non. Linear Problems γ and c control the complexity of decision boundary Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 58
How to Control the Complexity Which reasoning below is the most probable? • Bob got up and found that breakfast was ready • Level-1 His Child (Underfitting) • Level-2 His Wife (Reasonble) • Level-3 The Alien (Overfitting) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 59
How to Control the Complexity • • • SVM is powerful to approximate any training data The complexity affects the performance on new data SVM supports parameters for controlling the complexity SVM does not tell you how to set these parameters Determine the Parameters by Cross-Validation Underfitting Copyright © 2001, 2003, Andrew W. Moore Overfitting complexity Support Vector Machines: Slide 60
General Condition for Predictivity in Learning Theory • Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee and Partha Niyogi. General Condition for Predictivity in Learning Theory. Nature. Vol 428, March, 2004. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 61
Recall The MDL principle…… • • • MDL stands for minimum description length The description length is defined as: Space required to described a theory + Space required to described theory’s mistakes In our case theory is the classifier and the mistakes are the errors on the training data Aim: we want a classifier with minimal DL MDL principle is a model selection criterion Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 62
Support Vector Machine (SVM) for Noisy Data denotes +1 denotes -1 • Balance the trade off between margin and classification errors Describe the Theory Describe the Mistake Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 63
SVM Performance • • • Anecdotally they work very well indeed. Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark Another Example: Andrew knows several reliable people doing practical real-world work who claim that SVMs have saved them when their other favorite classifiers did poorly. There is a lot of excitement and religious fervor about SVMs as of 2001. Despite this, some practitioners are a little skeptical. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 64
References • An excellent tutorial on VC-dimension and Support Vector Machines: C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): 955 -974, 1998. http: //citeseer. nj. nec. com/burges 98 tutorial. html • The VC/SRM/SVM Bible: (Not for beginners including myself) Statistical Learning Theory by Vladimir Vapnik, Wiley. Interscience; 1998 • Software: SVM-light, http: //svmlight. joachims. org/, Lib. SVM, http: //www. csie. ntu. edu. tw/~cjlin/libsvm/ SMO in Weka Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 65
Support Vector Regression Copyright © 2001, 2003, Andrew W. Moore Nov 23 rd, 2001
Roadmap • • • Squared-Loss Linear Regression • Little Noise • Large Noise Linear-Loss Function Support Vector Regression Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 67
Linear Regression x f yest f(x, w, b) = w. x - b How would you fit this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 68
Linear Regression x f yest f(x, w, b) = w. x - b How would you fit this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 69
Linear Regression x f yest f(x, w, b) = w. x - b How would you fit this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 70
Linear Regression x f yest f(x, w, b) = w. x - b How would you fit this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 71
Linear Regression x f yest f(x, w, b) = w. x - b Any of these would be fine. . but which is best? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 72
Linear Regression x f yest f(x, w, b) = w. x - b How to define the fitting error of a linear regression ? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 73
Linear Regression x f yest f(x, w, b) = w. x - b How to define the fitting error of a linear regression ? Squared-Loss Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 74
Online Demo • http: //www. math. csusb. edu/faculty/stanton/m 262/ regress/regress. html Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 75
Sensitive to Outliers Outlier Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 76
Why ? • Squared-Loss Function • Fitting Error Grows Quadratically Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 77
How about Linear-Loss ? • Linear-Loss Function • Fitting Error Grows Linearly Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 78
Actually • SVR uses the Loss Function below e-insensitive loss function -e Copyright © 2001, 2003, Andrew W. Moore e Support Vector Machines: Slide 79
Epsilon Support Vector Regression (e-SVR) • • • Given: a data set {x 1, . . . , xn} with target values {u 1, . . . , un}, we want to do e-SVR The optimization problem is Similar to SVM, this can be solved as a quadratic programming problem Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 80
Online Demo • Less Sensitive to Outlier Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 81
Again, Extend to Non-Linear Case • Similar with SVM Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 82
What We Learn • • • Linear Classifier with Clean Data Linear Classifier with Noisy Data SVM for Noisy and Non-Linear Data • Linear Regression with Clean Data Linear Regression with Noisy Data SVR for Noisy and Non-Linear Data • General Condition for Predictivity in Learning Theory • • Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 83
The End Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 84
Saddle Point Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 85
- Tsvms
- Kim kroll
- Difference between note making and note taking
- Signal words for cause and effect
- Difference between note making and note taking
- What is debit note and credit note
- What is debit note and credit note
- Note taking and note making
- A teacher shall always recognize the almighty god
- Financial documents order
- Simple discount notes
- Main idea major and minor details
- Partitioning a line segment formula
- Coordenadas cartesianas
- Resolution of vectors
- A position vector is
- Providing support services facilities and other amenities
- Support vector machine icon
- Support vector machine regression
- Father of support vector machine
- Support vector machine exercise solutions
- Support vector machine pdf
- Svr regression
- Support vector regression
- Svm cost function
- Structured support vector machine
- Support vector machine intuition
- Types of positions
- Chapter 4 work and energy section 1 work and machines
- Numbers and stats
- Compound machine
- Section 1 work and machines
- Chapter 14 work power and machines
- Energy work and simple machines chapter 10 answers
- Man money method material machine
- Chapter 14 work power and machines
- Neural networks and learning machines 3rd edition
- Machines statics
- Trusses and frames
- Chapter 10 energy, work and simple machines answer key
- Energy work and simple machines chapter 10 answers
- Section 1 work and machines section 2 describing energy
- Gmail
- Les 6 machines simples
- Mechanical drives and lifting machines n2
- Work power energy and machines
- Constant current welding definition
- It has a wheel that is fixed to the axle
- Machines and gadgets
- Wheel and axle examples around the house
- Kinematics and dynamics of machines
- Lifting tools and tackles inspection
- Neural networks and learning machines
- Work power and machines
- Work power and machines
- Chapter 10 energy work and simple machines answer key
- Natasha tried holding her breath
- Rpms 2021 guidelines
- Pediatric first aid for caregivers and teachers
- Teacher professional growth plan examples
- Domain 7 in ppst
- Community linkages examples
- Thinkers who created new bodies of knowledge
- Feedback coaching and mentoring
- He and she ___ not pay attention of teachers announcement
- The teachers soul and the terrors of performativity
- Teachers and teacher aides working together
- Teachers and teacher aides working together
- Fair use guidelines for teachers
- Paraprofessionals and teachers working together
- Teachers and teacher aides working together
- Collegial approach
- Tongue twister big black bug
- Comments and suggestions for teachers observation
- Pd 1006
- Teachers insurance and annuity association
- Teachers' rights and responsibilities
- School teachers pay and conditions document
- Stop notice and note
- Notice and note nonfiction
- Words of the wiser signpost
- Notice and note close reading
- Notice and note signposts
- Again and again signpost picture books
- Source card meaning
- Example of contrast and contradiction