Lecture 1 Introduction to Machine Learning Isabelle Guyon

Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet. com

What is Machine Learning? Learning algorithm Trained machine TRAINING DATA Answer Query

What for? • • Classification Time series prediction Regression Clustering

Applications training example 5 10 s Market Ecology Analysis 103 102 10 System diagnosis 104 Machine Vision Text Categorization OCR HWR Bioinformatic s inputs 10 102 103 104 105

Banking / Telecom / Retail • Identify: – – Prospective customers Dissatisfied customers Good customers Bad payers • Obtain: – – More effective advertising Less credit risk Fewer fraud Decreased churn rate

Biomedical / Biometrics • Medicine: – Screening – Diagnosis and prognosis – Drug discovery • Security: – Face recognition – Signature / fingerprint / iris verification – DNA fingerprinting 6

Computer / Internet • Computer interfaces: – Troubleshooting wizards – Handwriting and speech – Brain waves • Internet – – – Hit ranking Spam filtering Text categorization Text translation Recommendation 7

Conventions n xi a X={xij} w m y ={yj}

Learning problem Data matrix: X m lines = patterns (data points, examples): samples, patients, documents, images, … n columns = features: (attributes, input variables): genes, proteins, words, pixels, … Unsupervised learning Is there structure in data? Supervised learning Predict an outcome y. Colon cancer, Alon et al 1999

Some Learning Machines • • Linear models Kernel methods Neural networks Decision trees

Linear Models • f(x) = w x +b = Sj=1: n wj xj +b Linearity in the parameters, NOT in the input components. • f(x) = w F(x) +b = Sj wj fj(x) +b (Perceptron) • f(x) = Si=1: m ai k(xi, x) +b (Kernel method)

Artificial Neurons x 1 x 2 Activation of other neurons xn 1 Mc. Culloch and Pitts, 1943 Cell potential w 1 w 2 S f(x) Axon wn b Activation function Synapses Dendrites f(x) = w x + b

Linear Decision Boundary hyperplane x 2 x 3 x 1 x 2 x 1

Perceptron x 1 x 2 xn Rosenblatt, 1957 f 1(x) f 2(x) w 1 w 2 S f(x) w. N f. N(x) 1 b f(x) = w F(x) + b

NL Decision Boundary x 2 x 3 x 1 x 2 x 1

Kernel Method x 1 x 2 xn k(x 1, x) Potential functions, Aizerman et al 1964 a 1 k(x 2, x) a 2 S am k(xm, x) 1 b f(x) = Si ai k(xi, x) + b k(. , . ) is a similarity measure or “kernel”.

What is a Kernel? A kernel is: • a similarity measure • a dot product in some feature space: k(s, t) = F(s) F(t) But we do not need to know the F representation. Examples: • k(s, t) = exp(-||s-t||2/s 2) Gaussian kernel • k(s, t) = (s t)q Polynomial kernel

Hebb’s Rule wj + yi xij Activation of another xj neuron wj S y Axon Dendrite Synapse Link to “Naïve Bayes”

Kernel “Trick” (for Hebb’s rule) • Hebb’s rule for the Perceptron: w = Si yi F(xi) f(x) = w F(x) = Si yi F(xi) F(x) • Define a dot product: k(xi, x) = F(xi) F(x) f(x) = Si yi k(xi, x)

Kernel “Trick” (general) • f(x) = Si ai k(xi, x) • k(xi, x) = F(xi) F(x) Dual forms • f(x) = w F(x) • w = Si ai F(xi)

Simple Kernel Methods S a k(x , x) f(x) = w • F(x) f(x) = w= k(xi, x) = F(xi). F(x) S a F(x ) i i i Perceptron algorithm w w + yi F(xi) if yif(xi)<0 (Rosenblatt 1958) Minover (optimum margin) w w + yi F(xi) for min yif(xi) (Krauth-Mézard 1987) LMS regression w w + (yi- f(xi)) F(xi) i i i Potential Function algorithm ai + yi if yif(xi)<0 (Aizerman et al 1964) Dual minover ai + yi for min yif(xi) (ancestor of SVM 1992, similar to kernel Adatron, 1998, and SMO, 1999) Dual LMS ai + (yi - f(xi))

Multi-Layer Perceptron Back-propagation, Rumelhart et al, 1986 S xj S S internal “latent” variables “hidden units”

Chessboard Problem

Tree Classifiers CART (Breiman, 1984) or C 4. 5 (Quinlan, 1993) x 2 All the data Choose x 2 Choose x 1 At each step, choose the feature that “reduces entropy” most. Work towards “node purity”.

Iris Data (Fisher, 1936) Figure from Norbert Jankowski and Krzysztof Grabczewski Linear discriminant Tree classifier setosa Gaussian mixture versicolor virginica Kernel method (SVM)

Fit / Robustness Tradeoff x 2 x 1 15

Performance evaluation f(x )= x 2 f(x) < 0 x 2 0 f(x) > 0 =0 f(x) > 0 x 1

Performance evaluation f(x )= x 2 f(x) < -1 -1 f(x) > -1 x 2 f(x) = -1 f(x) < -1 f(x) > -1 x 1

Performance evaluation f( x )= f(x) < 1 x 2 f(x) < 1 f(x) > 1 =1 f(x) > 1 x 1

ROC Curve For a given threshold on f(x), you get a point on the ROC curve. 100% Ideal ROC curve l a ctu C O R A Positive class success rate (hit rate, sensitivity) om d n Ra 0 C O R 1 - negative class success rate (false alarm rate, 1 -specificity) 100%

ROC Curve For a given threshold on f(x), you get a point on the ROC curve. 100% Ideal ROC curve (AUC=1) l a ctu C O R A . 5) 0 = Positive class success rate om (hit rate, sensitivity) OC C U A ( R nd a R 0 AUC 1 0 1 - negative class success rate (false alarm rate, 1 -specificity) 100%

What is a Risk Functional? A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. Examples: • Classification: – Error rate: (1/m) Si=1: m 1(F(xi) yi) – 1 - AUC • Regression: – Mean square error: (1/m) Si=1: m(f(xi)-yi)2

How to train? • Define a risk functional R[f(x, w)] • Optimize it w. r. t. w (gradient descent, mathematical programming, simulated annealing, genetic algorithms, etc. ) R[f(x, w)] Parameter space (w) w* (… to be continued in the next lecture)

Summary • With linear threshold units (“neurons”) we can build: – – Linear discriminant (including Naïve Bayes) Kernel methods Neural networks Decision trees • The architectural hyper-parameters may include: – The choice of basis functions f (features) – The kernel – The number of units • Learning means fitting: – Parameters (weights) – Hyper-parameters – Be aware of the fit vs. robustness tradeoff

Want to Learn More? • • • Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http: //rii. ricoh. com/~stork/DHS. html The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http: //www-statclass. stanford. edu/~tibs/Elem. Stat. Learn/ Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. Advances in Large Margin Classiers. Pages 147 --169, MIT Press, 2000. http: //clopinet. com/isabelle/Papers/guyon_stork_nips 98. ps. gz Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http: //clopinet. com/fextract-book A practical guide to model selection. I. Guyon In: Proceedings of the machine learning summer school (2009).

Challenges in Machine Learning (collection published by Microtome) http: //www. mtome. com/Publications/Ci. ML/ciml. html