Introduction to Neural Networks cont Dr David Wong

  • Slides: 27
Download presentation
Introduction to Neural Networks (cont. ) Dr David Wong (With thanks to Dr Gari

Introduction to Neural Networks (cont. ) Dr David Wong (With thanks to Dr Gari Clifford, G. I. T)

The Multi-Layer Perceptron • single layer can only deal with linearly separable data •

The Multi-Layer Perceptron • single layer can only deal with linearly separable data • Composed of many connected neurons • Three general layers; Input (i), hidden (j) and output (k) • Signals are presented to each input ‘neuron’ or node • Each signal is multiplied by a learned weighting factor (specific to each connection between each layer) • … and by a global activation function, • This is repeated in output layer to map the hidden node values to the output

Calculating Weights in a MLP Y • Cannot take the ‘simple’ approach, as we

Calculating Weights in a MLP Y • Cannot take the ‘simple’ approach, as we need take into account the ‘knock-on’ effect of multiple layers: Logistic X 2, 2 X 1, 2 Logistic X 1, 1 X 2, 1 X 3, 1 X 4, 1 X 5, 1 X 6, 1

Weight update as gradient descent Y Logistic X 2, 2 X 1, 2 Logistic

Weight update as gradient descent Y Logistic X 2, 2 X 1, 2 Logistic X 1, 1 X 2, 1 X 3, 1 X 4, 1 X 5, 1 X 6, 1

Weight update as gradient descent Y Logistic w 7 w 8 X 1, 2

Weight update as gradient descent Y Logistic w 7 w 8 X 1, 2 Logistic w 1 X 1, 1 w 2 X 2, 1 X 2, 2 w 4 w 3 X 3, 1 X 4, 1 w 6 w 5 X 5, 1 X 6, 1 Worked example with numbers here: https: //mattmazur. com/2015/03/17/a-step-by-step-backpropagation-example/

Weight update as gradient descent Y Logistic h 2 = w 7 x 1,

Weight update as gradient descent Y Logistic h 2 = w 7 x 1, 2 + w 8 x 2, 2 We’ve already worked this out for the single layer w 7 w 8 X 1, 2 h 1 = w 1 x 1, 1 + w 2 x 2, 1 + w 3 x 3, 2 w 1 X 1, 1 Logistic w 2 X 2, 1 X 2, 2 w 4 w 3 X 3, 1 X 4, 1 w 6 w 5 X 5, 1 X 6, 1 Worked example with numbers here: https: //mattmazur. com/2015/03/17/a-step-by-step-backpropagation-example/

Weight update as gradient descent h 2 = w 7 x 1, 2 +

Weight update as gradient descent h 2 = w 7 x 1, 2 + w 8 x 2, 2 We’ve already worked this out for the single layer Most important thing to note - we have used the answer from the output layer to help us work out the weights for the next later. Hence: backpropagation Worked example with numbers here: https: //mattmazur. com/2015/03/17/a-step-by-step-backpropagation-example/

MLP example - MNIST • We will be using MLP to predict digits in

MLP example - MNIST • We will be using MLP to predict digits in the MNIST dataset • MLP can achieve approximately 98% accuracy http: //scienceai. github. io/neocortex/mnist_mlp/

Multilayer Perceptrons for clinical data Neural network prediction of relapse in breast cancer patients.

Multilayer Perceptrons for clinical data Neural network prediction of relapse in breast cancer patients. Tarassenko et al. , 1996 • Goal: to predict relapse within 3 years • Features: Age, Tumour size, No. Nodes, Log er, Log egfr • Data: 350 patients • Architecture: 5 -N-1 (1<N<7) – i. e. 1 hidden layer • Results: 72% classification accuracy https: //link. springer. com/content/pdf/10. 1007/BF 01413746. pdf = relapse cases

How many hidden layers? • In theory, a neural network only needs one hidden

How many hidden layers? • In theory, a neural network only needs one hidden layer to learn anything (Cybenko 1989) • In practice, networks with more layers do better • Example: https: //playground. tensorflow. org

Deep Learning A Fast Learning Algorithm for Deep Belief Nets, Hinton, 2006 • Proportion

Deep Learning A Fast Learning Algorithm for Deep Belief Nets, Hinton, 2006 • Proportion of machine learning papers that contain the term ‘neural networks’ • Popular in the 90 s • Dip in the 2000 s Problem: backpropagation is tricky in highly connected neural nets http: //www. cs. toronto. edu/~fri tz/absps/ncfast. pdf

Deep Learning and Image. Net • The re-emergence of neural networks • Image. Net

Deep Learning and Image. Net • The re-emergence of neural networks • Image. Net - a large visual database for visual object recognition • Classify 150 K images into 1, 000 categories (e. g. Egyptian cat, gazelle, wok, photocopier) • 5 guesses allowed per picture • In 2012, Alex. Net, a ‘deep’ Neural Network won by a huge margin (12% error) • Current best is around 3% error

Deep Learning vs ‘shallow’ learning • Deep learning uses the same building blocks as

Deep Learning vs ‘shallow’ learning • Deep learning uses the same building blocks as normal Neural Networks • But many more layers! 2 x 1 3 x 1 4 x 1

Why Deep Learning • If any classification function can be learned with 1 hidden

Why Deep Learning • If any classification function can be learned with 1 hidden layer, why do we need deep learning?

Why Deep Learning • If any classification function can be learned with 1 hidden

Why Deep Learning • If any classification function can be learned with 1 hidden layer, why do we need deep learning? • No need to create features • E. g. in your assignment, the images get summarised as 30 pertinent numbers. In deep learning, we simply present the whole image (or the array of pixel values) to the neural network • It works better – Cybenko showed that 1 hidden layer was sufficient, but does not show (i) how many units required (ii) whether said network can be trained • (potentially) simulates vision in a more human-like way. Early layers correspond to primitive features (e. g. straight lines), late layers correspond to higher-level features (e. g. things that look like eyes).

Alex. Net • Uses Re. Lu (rather than logistic) units • Rectified Linear units

Alex. Net • Uses Re. Lu (rather than logistic) units • Rectified Linear units • Heuristic dropout to selectively ignore neurons • Overlapping max pooling • Graphics Processing Units

Convolutional Neural Networks (simplified version of Alex. Net) • Re. Lu vs Sigmoid •

Convolutional Neural Networks (simplified version of Alex. Net) • Re. Lu vs Sigmoid • Encourages sparsity • Sigmoids tend to, but never quite reach, zero • For high values of a=Wx+b • Sigmoid gradient diminishes towards zero (so-called vanishing gradient) • Re. Lu has a constant gradient • N. b. possible for too many units to go to zero, prohibiting learning • Quicker to compute (max(0, a))

Convolutional Neural Networks • Max-pooling: • Method of downsampling • Takes the max value

Convolutional Neural Networks • Max-pooling: • Method of downsampling • Takes the max value in a local neighbourhood • Overlapping max pooling means that neighbourhoods overlap (e. g. selected area in red) • Effect is to ‘blur’ an image, but to keep the pertinent structure – this means computation is faster • Each successive layer looks at a ‘bigger’ picture Overlapping max-pooling (3 x 3) 20 30 30 70 70 37 112 100 37

Convolutional Neural Networks • Convolution • Formally : • In 2 D - Broadly

Convolutional Neural Networks • Convolution • Formally : • In 2 D - Broadly equivalent to applying an image filter • In practice, treat the convolution mask as another set of parameters to be learned, and bundle in the back-propagation. NN learns a ‘good’ mask

Convolution

Convolution

CNN architecture 1. ) try multiple convolution masks to generate features (initialise these randomly)

CNN architecture 1. ) try multiple convolution masks to generate features (initialise these randomly) 2. ) use max pooling to ‘shrink’ the image 3. ) repeat – this has the effect of creating hierarchical features 4. ) candidate features are now put into a ‘normal’ feed-forward network 5. ) softmax used for multi-classification

Convolutional Neural Networks • Softmax • Extension of logistic model • In standard logistic

Convolutional Neural Networks • Softmax • Extension of logistic model • In standard logistic unit: • • Compute logit (ax+b) Apply logistic function Threshold Class as 1 or 0 • For softmax • Compute logit of each class • Softmax computes relative probability of each class • Used for multi-classification

Example – identifying hands for Parkinsons’ diagnosis • Finger tapping test for diagnosing slow-movement

Example – identifying hands for Parkinsons’ diagnosis • Finger tapping test for diagnosing slow-movement (bradykinesia) in Parkinsons’ patients • Get patients to tap their fingers as wide and as fast as possible • CNN used to separate the hand from the background

Example – identifying hands for Parkinsons’ diagnosis • Then convert each frame of the

Example – identifying hands for Parkinsons’ diagnosis • Then convert each frame of the video into a single number (representing how fast the hand is moving) • Generate features -> reduce dimensions -> classify • Figure shows normal (blue) vs abnormal (red) patients

Generative Adversarial Networks • Basically, two coupled neural networks • Generator – initially generates

Generative Adversarial Networks • Basically, two coupled neural networks • Generator – initially generates random signal (or image). • Feeds into • Discriminator – a pre-trained network (e. g. recognises cats) • Generator weights are updated based on how well/poorly it fools the discriminator

Example https: //thispersondoesnotexist. com/ - Neural Network generator -> produces new samples from scratch

Example https: //thispersondoesnotexist. com/ - Neural Network generator -> produces new samples from scratch - Neural Network discriminator -> classifies face as ‘real’ or ‘fake’ - If you want to see if you can do better than the GAN: http: //www. whichfaceisreal. com/