Machine Learning Lecture 4 Greedy Local Search Hill

Local search algorithms • We’ve discussed ways to select a hypothesis h that performs

Example: n-queens • Put n queens on an n × n board with no

Hill-climbing search • "Like climbing Everest in thick fog with amnesia“ h = initial.

Hill-climbing search • Problem: depending on initial state, can get stuck in local maxima

Underfitting • Overfitting: Performance on test examples is much lower than on training examples

Hill-climbing search: 8 -queens problem v =17 • v = number of pairs of

Hill-climbing search: 8 -queens problem • A local minimum with v = 1 Adapted

Simulated annealing search • Idea: escape local maxima by allowing some "bad" moves but

Properties of simulated annealing • One can prove: If T decreases slowly enough, then

Local beam search • Keep track of k states rather than just one •

Gradient Descent • Hill Climbing and Simulated Annealing are “generate and test” algorithms –

Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

About distance…. • Clustering requires distance measures. • Local methods require a measure of

Euclidean Distance Dimension 2: y • What people intuitively think of as “distance” Dimension

Generalized Euclidean Distance Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning

Weighting Dimensions • Apparent clusters at one scaling of X are not so apparent

Weighted Euclidean Distance • You can, of course compensate by weighting your dimensions…. Adapted

More Generalization: Minkowsky metric • My three favorites are special cases of this: Adapted

What is a “metric”? • A metric has these four qualities. • …otherwise, call

Metric, or not? • Driving distance with 1 -way streets • Categorical Stuff :

What about categorical variables? • Consider feature vectors for genre & vocals – Genre:

Binary Features + Hamming distance Blues Jazz 0 s 1 = {rock, yes} s

Hamming Distance Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS

Other approaches… • Define your own distance: f(a, b) Quote Frequency Beethoven Beatles Liz

Missing data • What if, for some category, on some examples, there is no

Dealing with missing data Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine

Slides: 27

Download presentation

Machine Learning Lecture 4: Greedy Local Search (Hill Climbing) Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Local search algorithms • We’ve discussed ways to select a hypothesis h that performs well on training examples, e. g. – Candidate-Elimination – Decision Trees • Another technique that is quite general: – Start with some (perhaps random) hypothesis h – Incrementally improve h • Known as local search Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Example: n-queens • Put n queens on an n × n board with no two queens on the same row, column, or diagonal Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Hill-climbing search • "Like climbing Everest in thick fog with amnesia“ h = initial. State loop: h’ = highest valued Successor(h) if Value(h) >= Value(h’) return h else h = h’ Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Hill-climbing search • Problem: depending on initial state, can get stuck in local maxima Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Underfitting • Overfitting: Performance on test examples is much lower than on training examples • Underfitting: Performance on training examples is low Two leading causes: – Hypothesis space is too small/simple – Training algorithm (i. e. , hypothesis search algorithm) stuck in local maxima Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Hill-climbing search: 8 -queens problem v =17 • v = number of pairs of queens that are attacking each other, either directly or indirectly Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Hill-climbing search: 8 -queens problem • A local minimum with v = 1 Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Simulated annealing search • Idea: escape local maxima by allowing some "bad" moves but gradually decrease their frequency h = initial. State T = initial. Temperature loop: h’ = random Successor(h) if ( V = Value(h’)-Value(h)) > 0 h = h’ else h = h’ with probability e V/T decrease T; if T==0, return h Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Properties of simulated annealing • One can prove: If T decreases slowly enough, then simulated annealing search will find a global optimum with probability approaching 1 • Widely used in VLSI layout, airline scheduling, etc Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Local beam search • Keep track of k states rather than just one • Start with k randomly generated states • At each iteration, all the successors of all k states are generated • If any one is a goal state, stop; else select the k best successors from the complete list and repeat. Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Gradient Descent • Hill Climbing and Simulated Annealing are “generate and test” algorithms – Successor function generates candidates, Value function helps select • In some cases, we can do much better: – Define: Error(training data D, hypothesis h) – If h is represented by parameters w 1, …wn and d. Error/dwi is known, we can compute the error gradient, and descend in the direction that is (locally) steepest Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

About distance…. • Clustering requires distance measures. • Local methods require a measure of “locality” • Search engines require a measure of similarity • So…. when are two things close to each other? Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Euclidean Distance Dimension 2: y • What people intuitively think of as “distance” Dimension 1: x Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Generalized Euclidean Distance Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Weighting Dimensions • Apparent clusters at one scaling of X are not so apparent at another scaling Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Weighted Euclidean Distance • You can, of course compensate by weighting your dimensions…. Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

More Generalization: Minkowsky metric • My three favorites are special cases of this: Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

What is a “metric”? • A metric has these four qualities. • …otherwise, call it a “measure” Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Metric, or not? • Driving distance with 1 -way streets • Categorical Stuff : – Is distance Jazz -> Blues -> Rock no less than distance Jazz -> Rock? Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

What about categorical variables? • Consider feature vectors for genre & vocals – Genre: {Blues, Jazz, Rock, Zydeco} – Vocals: {vocals, no vocals} s 1 = {rock, vocals} s 2 = {jazz, no vocals} s 3 = { rock, no vocals} • Which two songs are more similar? Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Binary Features + Hamming distance Blues Jazz 0 s 1 = {rock, yes} s 2 = {jazz, no} 0 s 3 = { rock, no vocals} 0 Rock Zydeco Vocals 0 1 1 0 0 Hamming Distance = number of bits different between binary vectors Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Hamming Distance Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Other approaches… • Define your own distance: f(a, b) Quote Frequency Beethoven Beatles Liz Phair Beethoven 7 0 0 Beatles 4 5 0 Liz Phair ? 1 2 Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Missing data • What if, for some category, on some examples, there is no value given? • Approaches: – Discard all examples missing the category – Fill in the blanks with the mean value – Only use a category in the distance measure if both examples give a value Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349

Dealing with missing data Adapted by Doug Downey from Bryan Pardo Fall 2007 Machine Learning EECS 349