Support Vector Machine SVM Based on Nello Cristianini

Support Vector Machine (SVM) Based on Nello Cristianini presentation http: //www. support-vector. net/tutorial. html

Basic Idea • Use Linear Learning Machine (LLM). • Overcome the linearity constraints: ¨ Map to non-linearly to higher dimension. • Select between hyperplans ¨ Use margin as a test • Generalization depends on the margin.

General idea Original Problem Transformed Problem

Kernel Based Algorithms • Two separate learning functions • Learning Algorithm: ¨ in an imbedded space • Kernel function ¨ performs the embedding

Basic Example: Kernel Perceptron • Hyperplane classification ¨ f(x)=<w, x>+b = <w’, x’> ¨ h(x)= sign(f(x)) • Perceptron Algorithm: ¨ Sample: (xi, ti), ti {-1, +1} ¨ If ti <wk, xi> < 0 THEN /* Error*/ ¨ wk+1 = wk + ti xi ¨ k=k+1

Recall • Margin of hyperplan w • Mistake bound

Observations • Solution is a linear combination of inputs ¨ w = ai ti xi ¨ where ai >0 • Mistake driven ¨ Only points on which we make mistake influence! • Support vectors ¨ The non-zero ai

Dual representation • Rewrite basic function: ¨ f(x) = <w, x> +b = ai ti <xi , x> +b ¨ w = ai ti xi • Change update rule: ¨ IF tj ( ai ti <xi , xj> +b) < 0 ¨ THEN aj = aj+1 • Observation: ¨ Data only inside inner product!

Limitation of Perceptron • Only linear separations • Only converges for linearly separable data • Only defined on vectorial data

The idea of a Kernel • Embed data to a different space • Possibly higher dimension • Linearly separable in the new space. Original Problem Transformed Problem

Kernel Mapping • • • Need only to compute inner-products. Mapping: M(x) Kernel: K(x, y) = < M(x) , M(y)> Dimensionality of M(x): unimportant! Need only to compute K(x, y) Using it in the embedded space: ¨ Replace <x, y> by K(x, y)

Example x=(x 1 , x 2); z=(z 1 , z 2); K(x, z) = (<x, z>)2

Polynomial Kernel Original Problem Transformed Problem

Kernel Matrix

Example of Basic Kernels • Polynomial ¨ K(x, z)= (<x, z> )d • Gaussian ¨ K(x, z)= exp{- ||x-z||2 /2 }

Kernel: Closure Properties • • K(x, z) = K 1(x, z) + c K(x, z) = c*K 1(x, z) K(x, z) = K 1(x, z) * K 2(x, z) K(x, z) = K 1(x, z) + K 2(x, z) • Create new kernels using basic ones!

Support Vector Machines • Linear Learning Machines (LLM) • Use dual representation • Work in the kernel induced feature space ¨ f(x) = ai ti K(xi , x) +b • Which hyperplane to select

Generalization of SVM • PAC theory: ¨ error = O( Vcdim / m) ¨ Problem: Vcdim >> m ¨ No preference between consistent hyperplanes

Margin based bounds • • H: Basic Hypothesis class conv(H): finite convex combinations of H D: Distribution over X and {+1, -1} S: Sample of size m over D

Margin based bounds • THEOREM: for every f in conv(H)

Maximal Margin Classifier • Maximizes the margin • Minimizes the overfitting due to margin selection. • Increases margin ¨ Rather than reduce dimensionality

SVM: Support Vectors

Margins • Geometric Margin: mini ti f(xi)/ ||w|| Functional margin: mini ti f(xi) f(x)

Main trick in SVM • Insist on functional marginal at least 1. ¨ Support vectors have margin 1. • Geometric margin = 1 / || w|| • Proof.

SVM criteria • Find a hyperplane (w, b) • That Maximizes: || w ||2 = <w, w> • Subject to: ¨ for all i ¨ ti (<w, xi>+b) 1

Quadratic Programming • • Quadratic goal function. Linear constraint. Unique Maximum. Polynomial time algorithms.

Dual Problem • Maximize ¨ W(a) = ai - 1/2 i, j ai ti aj tj K(xi , xj) +b • Subject to ¨ i ai ti =0 ¨ ai 0

Applications: Text • Classify a text to given categories ¨ Sports, news, business, science, … • Feature space ¨ Bag of words ¨ Huge sparse vector!

Applications: Text • Practicalities: ¨ Mw(x) = tfw log (idfw) / K ¨ ftw = text frequency of w ¨ idfw = inverse document frequency ¨ idfw = # documents / # documents with w • Inner product <M(x), M(z)> ¨ sparse vectors • SVM: finds a hyperplan in “document space”