Perceptrons Slides adapted from Yaser AbuMostafa and Mohamed

Perceptrons Slides adapted from Yaser Abu-Mostafa and Mohamed Batouche

Roadmap • Covered two models – K-NN Classification • All features matter – Decision Tree Classification • Some features matter • Today… – Features matter to varying degrees • Linear methods (lines, planes, and hyperplanes) – Benefits • Simple algorithms • (Human) interpretable model parameters – Drawbacks • Most real-world problems are not linear 2

Creditworthiness • Banks would like to decide whether or not to extend credit to new customers – Good customers pay back loans – Bad customers default • Problem: Predict creditworthiness based on: – Salary, Years in residence, Current debt, Age, etc. 3

Perceptron • For a customer, – Approve credit if – Deny credit if • This linear formula can be written as: 4

Perceptron (cont’d) • Introduce an artificial coordinate x 0 = 1 – Called the bias term • So, perceptron implements: 5

Perceptron (Geometrically) x 2 wx 1 1 wx 1 +w 2 x 2 +w 0 =0 1 x 3 +w 2 x 2 +w 3 x 3 +w 0 x 1 =0 x 1 • Weights are the coefficients of hyperplane • w. Tx is the (signed) distance of point x to hyperplane w – Example: http: //mathinsight. org/distance_point_plane 6

Perceptron Learning Algorithm (PLA) Example: https: //www. youtube. com/watch? v=v. Gwem. Zh. Pls. A This +1/-1 label trick is useful. We’ll see it again. 7

PLA Summary • One iteration of PLA updates the weights, w, for a misclassified point • Done when there are no misclassified points • Guaranteed to converge if data is linearly separable – What if the data is not linearly separable? 8

Example Linear Classification 9

Input Representation • Raw input: 16*16 image = 256 pixels x = (x 0, x 1, x 2, …, x 256) • Linear model: w = (w 0, w 1, w 2, …, w 256) • Features: Extract useful information from raw data – Intensity and symmetry x = (x 0, x 1, x 2) – Linear model w = (w 0, w 1, w 2) 10

Feature Space • Consider a binary classification task – “Is x a ‘ 1’ (positive) or a ‘ 5’ (negative)? ” Not linearly separable 11

PLA • Not linearly separable Always a misclassified example – PLA will never converge – So, let’s just stop after (let’s say) 1000 iterations What’s the fix? 12

‘Pocket’ Algorithm • Run PLA for a fixed number of iterations • Keep the best hypothesis (lowest Ein) in your “pocket” • Return the best hypothesis 13

PLA versus Pocket 14

Linear Regression • Classification: Credit approval (yes / no) • Regression: Credit line (dollar amount) • Same input, x = • Linear regression output: 15

Training Data • Credit officers decide on credit lines: Credit line for the corresponding customer • Linear regression attempts to replicate this 16

Error Measure • Classification error (right or wrong) won’t work here – With regression, you can be “close” or “far” • How well does h(x) = w. Tx approximate f(x)? • We use squared error: (h(x) – f(x))2 17

Linear Regression Visually • Circles are training examples • Line/Plane represent the hypothesis • Red lines represent in-sample error 18

Training (In-Sample) Error, Ein 19

Minimizing Ein 20

Linear Regression Algorithm 21

Linear Regression for Classification • Linear regression learns a real-valued function y = f(x) • Binary-valued functions are also real-valued: {+1, -1} • Use linear regression to get w where • sign(w. TXn) is likely to agree with • These are good initial weights for PLA / Pocket 22

Decision Boundaries Linear Regression: 23

Recap • Linear methods – Lines, planes, hyperplanes – classification (PLA, Pocket Algorithm) – regression (pseudo-inverse) • Benefits of linear approaches – Simple algorithms – (Human) interpretable model parameters • Drawbacks – Most real-world problems are not linear 24

Back to Credit Example • Credit line is affected by ‘years in residence’ – But perhaps not linearly • Perhaps credit officers use a different model – Is ‘years in residence’ less than 1? – Is ‘years in residence’ greater than 5? • Can we do that with linear models? 25

Linear? 26

Linear in what? • Linear regression implements: • Linear classification implements: • What are the unknowns? • Algorithms work because of linearity in the weights 27

Nonlinear Transformations We’ll revisit this idea for support vector machines 28

Beyond Perceptrons • Two main approaches (Coming soon!) – Combining perceptrons: neural networks – Efficient data transforms: kernel methods 29

Summary Credit Analysis • Benefits of linear approaches – Simple algorithms – (Human) interpretable model parameters • Drawbacks – Most real-world problems are not linear • Nonlinear transformations can help 30