CS 189 Brian Chu brian cberkeley edu Slides

CS 189 Brian Chu brian. c@berkeley. edu Slides at: brianchu. com/ml/ Office Hours: Cory 246, 6 -7 p Mon. (hackerspace lounge) twitter: @brrrianchu

Agenda • NEURAL NETS WOOOHOOO

Terminology • Unit – each “neuron” • 2 -layer neural network: a neural network with one hidden layer (what you’re building) • Epoch – one pass through entire training data – For SGD, this is N iterations – For mini-batch gradient descent (batch size of B), this is (N/B) iterations

First off… • Many of you will struggle to even finish. • In which case you can ignore my bells and whistles. • My 2. 6 GHz quad core 16 GB RAM Macbook takes ~1. 5 hours to train to ~96 -97%.

First off… • Add a signal handler + snapshotting • E. g. implement functionality where if you press Ctrl-C (on Unix systems, this is sending the interrupt signal), your code saves a snapshot of the state of the training (current iteration, decayed learning rate, momentum, current weights, anything else), then exits. – Look into Python “signal” and “pickle” libraries.

Art of tuning • Training neural nets is an art, not a science • Cross-validation? Pfffft • “I used to tune that parameter but I’m too lazy and I don’t bother any more” – grad student talking about weight decay hyperparameter. • There are way too many hyperparameters for you to tune. • Training is too slow for you to bother using crossvalidation. • Many hyperparameters: just use what is standard and spend your time elsewhere

Knobs • Learning: SGD/mini-batch/full batch, momentum, RMSprop, Adagrad, NAG, etc. – How to decay? • • • Re. LU, tanh, sigmoid activations Loss: MSE or cross-entropy (with softmax) L 1, L 2, Max-norm, Dropout, Dropconnect regularization Convolutional layers Initialization: Xavier, Gaussian, etc. When to stop? Early stop? Stopping rule? Or just run forever

* = What everyone in the literature, in practice, uses I recommend • Cross-entropy, softmax * • Only decay per epoch (or more than 1 epoch)* – – (e. g. don’t just divide by # iterations) Epoch = one training pass thru entire data Only decay after a round of seeing every data point. Note: if your mini-batch size is 10, N = 20, then one epoch is 2 iterations • Momentum learning rate (0. 7 -0. 9? ) * – Maybe RMSProp? • • Mini-batch (somewhere between 20 -100) * No regularization. Gaussian initialization (mean 0, std. dev. 0. 01) * Run forever, take a snapshot when you feel like stopping (seriously!)

Activation functions • tanh >>> sigmoid – (tanh is just shifted sigmoid anyways) • Re. LU = stacked sigmoid • Re. LU is basically standard in computer vision

Almost certainly will improve accuracy but total overkill • Considered “standard” today: – Convolutional layers (with max-pooling) – Dropout (Dropconnect? )

If using numpy • Not a single for-loop should be in your code. • Avoid unnecessary memory allocation: • Use the “out=“ keyword argument to re-use numpy arrays

May want to consider • Faster implementation than Python w/ numpy: • Cython, Java, Go, Julia, etc.

Honestly, if you want to win… • (if you have a compatible graphics card) Write a CUDA or Open. CL implementation, train for many days. – (you might consider adding regularization in this case) • I didn’t do this: I used other generic tricks that you can read in the literature.

Debugging • Check your dimensions • Check your numpy dtypes • Check your derivatives – comment all your backprop steps • Numerical gradient calculator: – https: //github. com/pbrod/numdifftools

Connection with SVMs / linear classifiers with kernels • Kernel SVM can be thought of as: • 1 st layer: |units| = |support vectors| – Value of each unit i = K(query, train(i)) • 2 nd layer: linear combo of first layer • Simplest training for 1 st layer: store all training points as templates. http: //www. kdnuggets. com/2014/02/exclusive-yann-lecun-deep-learning-facebook-ai-lab. html