Gradient Descent Slides adapted from David Kauchak Michael

  • Slides: 35
Download presentation
Gradient Descent Slides adapted from David Kauchak, Michael T. Brannick, Ethem Alpaydin, and Yaser

Gradient Descent Slides adapted from David Kauchak, Michael T. Brannick, Ethem Alpaydin, and Yaser Abu-Mostafa

Linear models A strong high-bias assumption is linear separability: • in 2 dimensions, can

Linear models A strong high-bias assumption is linear separability: • in 2 dimensions, can separate classes by a line • in higher dimensions, need hyperplanes A linear model assumes the data is linearly separable 2

Linear models A linear model in n-dimensional space (i. e. n features) is defined

Linear models A linear model in n-dimensional space (i. e. n features) is defined by n+1 weights: In two dimensions, a line: bias In three dimensions, a plane: Generally, in m-dimensions, a hyperplane: 3

Perceptron Learning Algorithm (PLA) 4

Perceptron Learning Algorithm (PLA) 4

A closer look at PLA Current weights: (0, 0, 1) Current example: (1, -1)

A closer look at PLA Current weights: (0, 0, 1) Current example: (1, -1) + bias We’d like this value to be positive since it’s a positive example increase decrease contributed in the wrong direction decrease 0 -> 1 0 -> -1 1 -> 0 didn’t contribute, but could have 5

Which line will it find? 6

Which line will it find? 6

Which line will it find? Only guaranteed to find some line that separates the

Which line will it find? Only guaranteed to find some line that separates the data 7

Linear models • Perceptron algorithm is one example of a linear classifier • Many,

Linear models • Perceptron algorithm is one example of a linear classifier • Many, many other algorithms that learn a line (i. e. a setting of a linear combination of weights) • Goals: • Explore a number of linear training algorithms • Understand why these algorithms work 8

Model-based machine learning 1. pick a model - e. g. a hyperplane, a decision

Model-based machine learning 1. pick a model - e. g. a hyperplane, a decision tree, … A model is defined by a collection of parameters 2. pick a criterion to optimize (aka objective function) - e. g. training error 3. develop a learning algorithm - the algorithm should try and minimize the criteria sometimes in a heuristic way (i. e. non-optimally), sometimes exactly 9

Linear Models 1. pick a model These are the parameters we want to learn

Linear Models 1. pick a model These are the parameters we want to learn 2. pick a criteria to optimize (aka objective function) Indicator Function 3. develop a learning algorithm Find w and b that minimize the 0/1 loss (i. e. training error) 10

0/1 loss function whether or not the prediction and label agree, true if they

0/1 loss function whether or not the prediction and label agree, true if they don’t - distance from hyperplane - sign is prediction total number of mistakes, aka 0/1 loss 11

Linear Models 1. pick a model These are the parameters we want to learn

Linear Models 1. pick a model These are the parameters we want to learn 2. pick a criteria to optimize (aka objective function) 3. develop a learning algorithm Find w and b that minimize the 0/1 loss (i. e. training error) 12

Minimizing 0/1 loss Find w and b that minimize the 0/1 loss 13

Minimizing 0/1 loss Find w and b that minimize the 0/1 loss 13

Minimizing 0/1 in one dimension loss Why is it hard to minimize this function?

Minimizing 0/1 in one dimension loss Why is it hard to minimize this function? w Each time we change w such that the example is right/wrong the loss will increase/decrease 14

Minimizing 0/1 over all w loss Each feature (i. e. weight) represents another dimension

Minimizing 0/1 over all w loss Each feature (i. e. weight) represents another dimension to this space! 15

Minimizing 0/1 loss Find w and b that minimize the 0/1 loss This turns

Minimizing 0/1 loss Find w and b that minimize the 0/1 loss This turns out to be hard (in fact, NP-HARD ) Challenges: - small changes in some w can have large effect on loss - large changes in some w can have no effect on loss - there can be many, many local minima - at any given point, not much information to lead to any minima 16

More manageable loss functions loss w What property/properties do we want from our loss

More manageable loss functions loss w What property/properties do we want from our loss function? 17

More manageable loss functions loss w - Ideally, continuous (i. e. differentiable) so we

More manageable loss functions loss w - Ideally, continuous (i. e. differentiable) so we get an indication of direction of minimization - Only one minima 18

Convex functions • Smooth, continuous • Local = global minimum • Line segment between

Convex functions • Smooth, continuous • Local = global minimum • Line segment between any two points on the function • is above the function • Stays “inside” function Convex Non-convex 19

Surrogate loss functions • For many applications, we really would like to minimize the

Surrogate loss functions • For many applications, we really would like to minimize the 0/1 loss • A surrogate loss function is a loss function that provides an upper bound on the actual loss function (in this case, 0/1) • convex surrogate loss functions should be easier to minimize • Key to a loss function: how it scores the difference between the actual label y and the predicted label y’ 20

Surrogate loss functions 0/1 loss: Hinge: Exponential: Squared loss: 21

Surrogate loss functions 0/1 loss: Hinge: Exponential: Squared loss: 21

Surrogate loss functions Hinge: 0/1 (classification) loss: Squared (Quadratic) loss: Exponential: 22

Surrogate loss functions Hinge: 0/1 (classification) loss: Squared (Quadratic) loss: Exponential: 22

Linear Model Learning w/ Surrogate Loss 1. pick a model 2. pick a criteria

Linear Model Learning w/ Surrogate Loss 1. pick a model 2. pick a criteria to optimize (aka objective function) use a convex surrogate loss function (e. g. , exponential) 3. develop a learning algorithm Find w and b that minimize surrogate loss 23

Minimizing convex loss function Find w and b that minimize the surrogate loss w

Minimizing convex loss function Find w and b that minimize the surrogate loss w How do we do this now? 24

One approach: gradient descent Partial derivatives give us the slope (i. e. direction to

One approach: gradient descent Partial derivatives give us the slope (i. e. direction to move) in that dimension loss w 25

One approach: gradient descent Partial derivatives give us the slope (i. e. direction to

One approach: gradient descent Partial derivatives give us the slope (i. e. direction to move) in that dimension loss Approach: • pick a starting point (w) • repeat: • pick a dimension • move a small amount in that dimension towards decreasing loss (using the derivative) w 26

Gradient descent • pick a starting point (w) • repeat until loss doesn’t decrease

Gradient descent • pick a starting point (w) • repeat until loss doesn’t decrease in any dimension: • pick a dimension • move a small amount in that dimension towards decreasing loss (using the derivative) learning rate (how much we want to move in the error direction, often this will change over time) 27

Gradient Descent w/ Exponential Loss 28

Gradient Descent w/ Exponential Loss 28

Gradient descent • pick a starting point (w) • repeat until loss doesn’t decrease

Gradient descent • pick a starting point (w) • repeat until loss doesn’t decrease in any dimension: • pick a dimension • move a small amount in that dimension towards decreasing loss (using the derivative) 29

Exponential update rule for each example xi: Notice the weight update. Does this look

Exponential update rule for each example xi: Notice the weight update. Does this look familiar? 30

Perceptron learning algorithm! or where 31

Perceptron learning algorithm! or where 31

The “new” constant learning rate label prediction When is this large/small? 32

The “new” constant learning rate label prediction When is this large/small? 32

The “new” constant label prediction If prediction is correct (same sign), as the prediction

The “new” constant label prediction If prediction is correct (same sign), as the prediction gets larger, the update gets smaller If the prediction is incorrect (different sign), the more different they are, the bigger the update 33

Perceptron learning algorithm! Note: for gradient descent, we always update or where 34

Perceptron learning algorithm! Note: for gradient descent, we always update or where 34

Recap Model-based machine learning: - define a model, objective function (i. e. loss function),

Recap Model-based machine learning: - define a model, objective function (i. e. loss function), minimization algorithm Gradient descent minimization algorithm - requires that our loss function is convex make small updates towards lower losses Perceptron learning algorithm: - gradient descent exponential loss function (modulo a learning rate) 35