Learning Neural Networks Artificial Intelligence CMSC 25000 February

Learning: Neural Networks Artificial Intelligence CMSC 25000 February 3, 2005

Roadmap • Neural Networks – Motivation: Overcoming perceptron limitations – Motivation: ALVINN – Heuristic Training • Backpropagation; Gradient descent • Avoiding overfitting • Avoiding local minima – Conclusion: Teaching a Net to talk

Perceptron Summary • Motivated by neuron activation • Simple training procedure • Guaranteed to converge – IF linearly separable

Neural Nets • Multi-layer perceptrons – Inputs: real-valued – Intermediate “hidden” nodes – Output(s): one (or more) discrete-valued X 1 Y 1 X 2 X 3 X 4 Inputs Y 2 Hidden Outputs

Neural Nets • Pro: More general than perceptrons – Not restricted to linear discriminants – Multiple outputs: one classification each • Con: No simple, guaranteed training procedure – Use greedy, hill-climbing procedure to train – “Gradient descent”, “Backpropagation”

Solving the XOR Problem o 1 Network Topology: 2 hidden nodes 1 output Desired behavior: x 1 x 2 o 1 o 2 y 0 0 0 1 1 0 1 1 1 0 x 1 w 11 w 01 w 21 w 12 x 2 w 13 -1 y w 23 w 22 w 03 -1 w 02 o 2 -1 Weights: w 11= w 12=1 w 21=w 22 = 1 w 01=3/2; w 02=1/2; w 03=1/2 w 13=-1; w 23=1

Neural Net Applications • Speech recognition • Handwriting recognition • NETtalk: Letter-to-sound rules • ALVINN: Autonomous driving

ALVINN • Driving as a neural network • Inputs: – Image pixel intensities • I. e. lane lines • 5 Hidden nodes • Outputs: – Steering actions • E. g. turn left/right; how far • Training: – Observe human behavior: sample images, steering

Backpropagation • Greedy, Hill-climbing procedure – Weights are parameters to change – Original hill-climb changes one parameter/step • Slow – If smooth function, change all parameters/step • Gradient descent – Backpropagation: Computes current output, works backward to correct error

Producing a Smooth Function • Key problem: – Pure step threshold is discontinuous • Not differentiable • Solution: – Sigmoid (squashed ‘s’ function): Logistic fn

Neural Net Training • Goal: – Determine how to change weights to get correct output • Large change in weight to produce large reduction in error • Approach: • • Compute actual output: o Compare to desired output: d Determine effect of each weight w on error = d-o Adjust weights

Neural Net Example y 3 xi : ith sample input vector w : weight vector yi*: desired output for ith sample - z 3 w 03 -1 z 1 Sum of squares error over training samples w 13 y 1 w 23 y 2 w 21 w 01 -1 z 1 w 12 x 1 z 2 z 3 Full expression of output in terms of input and weights w 22 x 2 z 2 w 02 -1 From 6. 034 notes lozano-perez

Gradient Descent • Error: Sum of squares error of inputs with current weights • Compute rate of change of error wrt each weight – Which weights have greatest effect on error? – Effectively, partial derivatives of error wrt weights • In turn, depend on other weights => chain rule

Gradient Descent • E = G(w) – Error as function of weights • Find rate of change of error – Follow steepest rate of change – Change weights s. t. error is minimized d. G dw E G(w) w 0 w 1 Local minima w

Gradient of Error z 1 z 2 y 3 z 3 w 03 Note: Derivative of sigmoid: ds(z 1) = s(z 1)(1 -s(z 1)) dz 1 -1 z 1 w 01 -1 w 13 y 1 w 23 y 2 w 21 w 12 x 1 w 22 x 2 z 2 w 02 -1 From 6. 034 notes lozano-perez MIT AI lecture notes, Lozano. Perez 2000

From Effect to Update • Gradient computation: – How each weight contributes to performance • To train: – Need to determine how to CHANGE weight based on contribution to performance – Need to determine how MUCH change to make per iteration • Rate parameter ‘r’ – Large enough to learn quickly – Small enough reach but not overshoot target values

Backpropagation Procedure i j • Pick rate parameter ‘r’ • Until performance is good enough, k – Do forward computation to calculate output – Compute Beta in output node with – Compute Beta in all other nodes with – Compute change for all weights with

Backprop Example y 3 z 3 w 03 Forward prop: Compute zi and yi given xk, wl -1 z 1 w 13 y 1 w 23 w 21 w 01 -1 w 11 x 1 w 12 y 2 w 22 x 2 z 2 w 02 -1

Backpropagation Observations • Procedure is (relatively) efficient – All computations are local • Use inputs and outputs of current node • What is “good enough”? – Rarely reach target (0 or 1) outputs • Typically, train until within 0. 1 of target

Neural Net Summary • Training: – Backpropagation procedure • Gradient descent strategy (usual problems) • Prediction: – Compute outputs based on input vector & weights • Pros: Very general, Fast prediction • Cons: Training can be VERY slow (1000’s of epochs), Overfitting

Training Strategies • Online training: – Update weights after each sample • Offline (batch training): – Compute error over all samples • Then update weights • Online training “noisy” – Sensitive to individual instances – However, may escape local minima

Training Strategy • To avoid overfitting: – Split data into: training, validation, & test • Also, avoid excess weights (less than # samples) • Initialize with small random weights – Small changes have noticeable effect • Use offline training – Until validation set minimum • Evaluate on test set – No more weight changes

Classification • Neural networks best for classification task – Single output -> Binary classifier – Multiple outputs -> Multiway classification • Applied successfully to learning pronunciation – Sigmoid pushes to binary classification • Not good for regression

Neural Net Example • NETtalk: Letter-to-sound by net • Inputs: – Need context to pronounce • 7 -letter window: predict sound of middle letter • 29 possible characters – alphabet+space+, +. – 7*29=203 inputs • 80 Hidden nodes • Output: Generate 60 phones – Nodes map to 26 units: 21 articulatory, 5 stress/sil • Vector quantization of acoustic space

Neural Net Example: NETtalk • Learning to talk: – 5 iterations/1024 training words: bound/stress – 10 iterations: intelligible – 400 new test words: 80% correct • Not as good as Dec. Talk, but automatic

Neural Net Conclusions • Simulation based on neurons in brain • Perceptrons (single neuron) – Guaranteed to find linear discriminant • IF one exists -> problem XOR • Neural nets (Multi-layer perceptrons) – Very general – Backpropagation training procedure • Gradient descent - local min, overfitting issues