Machine Learning ICS 178 Instructor Max Welling Supervised

  • Slides: 19
Download presentation
Machine Learning ICS 178 Instructor: Max Welling Supervised Learning

Machine Learning ICS 178 Instructor: Max Welling Supervised Learning

Supervised Learning I Example: Imagine you want to classify versus Data: 100 monkey images

Supervised Learning I Example: Imagine you want to classify versus Data: 100 monkey images and 200 human images with labels what is what. where x represents the greyscale of the image pixels and y=0 means “monkey” while y=1 means “human”. Task: Here is a new image: monkey or human?

1 nearest neighbors (your first ML algorithm!) Idea: 1. Find the picture in the

1 nearest neighbors (your first ML algorithm!) Idea: 1. Find the picture in the database which is closest to your query image. 2. Check its label. 3. Declare the class of your query image to be the same as that of the closest picture. query closest image

1 NN Decision Surface decision curve

1 NN Decision Surface decision curve

Distance Metric • How do we measure what it means to be “close”? •

Distance Metric • How do we measure what it means to be “close”? • Depending on the problem we should choose an appropriate “distance” metric (or more generally, a (dis)similarity measure) -Demo: http: //www. comp. lancs. ac. uk/~kristof/research/notes/nearb/cluster. html -Matlab Demo.

Remarks on NN methods • We only need to construct a classifier that works

Remarks on NN methods • We only need to construct a classifier that works locally for each query. Hence: We don’t need to construct a classifier everywhere in space. • Classifying is done at query time. This can be computationally taxing at a time where you might want to be fast. • Memory inefficient. • Curse of dimensionality: imagine many features are irrelevant / noisy distances are always large. • Very flexible, not many prior assumptions. • k-NN variants robust against “bad examples”.

Non-parametric Methods • Non-parametric methods keep all the data cases/examples in memory. • A

Non-parametric Methods • Non-parametric methods keep all the data cases/examples in memory. • A better name is: “instance-based” learning • As the data-set grows, the complexity of the decision surface grows. • Sometimes, non-parametric methods have some parameters to tune. . . • “Assumption light”

Unsupervised learning • In unsupervised learning we are concerned with finding the structure in

Unsupervised learning • In unsupervised learning we are concerned with finding the structure in the data. • For instance, we may want to learn the probability distribution from which the data was sampled. • Data now looks like (no labels): • A nonparametric estimate of the probability distribution can be obtained through Parzen-windowing (or kernel estimation).

Parzen Windowing • First define a “kernel”. This is some function that “smoothes” a

Parzen Windowing • First define a “kernel”. This is some function that “smoothes” a data-point and satisfies: • We estimate the distribution as: • Example: Gaussian kernel (RBF kernel) • When you add these up with weights 1/N you get:

Example: Distribution of Grades

Example: Distribution of Grades

Overfitting • Fundamental issue with Parzen windowing: How do you choose the width of

Overfitting • Fundamental issue with Parzen windowing: How do you choose the width of the kernel ? • To broad: you loose the details, to small, you see too much detail. • Imagine you left out part of the data-points when you produced your estimate for the distribution. A good fit is one that would have high probability for the left out data-points. • So you can tune on the left out data! • This is called (cross) validation. • This issue is fundamental to machine learning: how much fitting is just right.

Overfitting in a Picture error on hold-out data error on training data

Overfitting in a Picture error on hold-out data error on training data

Generalization • Consider the following regression problem: • Predict the real value on the

Generalization • Consider the following regression problem: • Predict the real value on the y-axis from the real value on the x-axis. • You are given 6 examples: {Xi, Yi}. • What is the y-value for a new query point X* ? X*

Generalization

Generalization

Generalization

Generalization

Generalization which curve is best?

Generalization which curve is best?

Generalization • Ockham’s razor: prefer the simplest hypothesis consistent with data.

Generalization • Ockham’s razor: prefer the simplest hypothesis consistent with data.

Your Take-Home Message Learning is concerned with accurate prediction of future data, not accurate

Your Take-Home Message Learning is concerned with accurate prediction of future data, not accurate prediction of training data. (The single most important sentence you will see in the course)

Homework • Read chapter 4, pages 161 -165, 168 -17. • Download netflix data

Homework • Read chapter 4, pages 161 -165, 168 -17. • Download netflix data and plot: – ratings as a function of time. – variance in ratings as a function of the mean of movie ratings.