Network Structure Hungyi Lee Three Steps for Deep

  • Slides: 52
Download presentation
Network Structure Hung-yi Lee 李宏毅

Network Structure Hung-yi Lee 李宏毅

Three Steps for Deep Learning Step 1: Neural Network Step 2: Cost Function Step

Three Steps for Deep Learning Step 1: Neural Network Step 2: Cost Function Step 3: Optimization Step 1. A neural network is a function composed of simple functions (neurons) Ø Usually we design the network structure, and let machine find parameters from data Step 2. Cost function evaluates how good a set of parameters is Ø We design the cost function based on the task Step 3. Find the best function set (e. g. gradient descent)

Outline • Basic structure (3/03) • Fully Connected Layer • Recurrent Structure • Convolutional/Pooling

Outline • Basic structure (3/03) • Fully Connected Layer • Recurrent Structure • Convolutional/Pooling Layer • Special Structure (3/17) • Spatial Transformation Layer • Highway Network / Grid LSTM • Recursive Structure • External Memory • Batch Normalization • Sequence-to-sequence / Attention (3/24)

Prerequisite • Brief Introduction of Deep Learning • https: //youtu. be/Dr. WRl. EFefw? list=PLJV_el

Prerequisite • Brief Introduction of Deep Learning • https: //youtu. be/Dr. WRl. EFefw? list=PLJV_el 3 u. VTs. Py 9 o. CRY 30 o. BPNLCo 89 yu 49 • Convolutional Neural Network • https: //youtu. be/Fr. KWi. Rv 254 g? list=PLJV_el 3 u. VTs. Py 9 o. C RY 30 o. BPNLCo 89 yu 49 • Recurrent Neural Network (Part I) • https: //youtu. be/x. CGid. Aey. S 4 M? list=PLJV_el 3 u. VTs. Py 9 o CRY 30 o. BPNLCo 89 yu 49 • Recurrent Neural Network (Part II) • https: //www. youtube. com/watch? v=r. Tqm. Wlnwz_0&list =PLJV_el 3 u. VTs. Py 9 o. CRY 30 o. BPNLCo 89 yu 49&index=25

Basic Structure: Fully Connected Layer

Basic Structure: Fully Connected Layer

Fully Connected Layer Output of a neuron: Layer …… …… Neuron i Output of

Fully Connected Layer Output of a neuron: Layer …… …… Neuron i Output of one layer: …… …… Layer nodes : a vector

Fully Connected Layer to Layer …… …… Layer nodes from neuron j (Layer to

Fully Connected Layer to Layer …… …… Layer nodes from neuron j (Layer to neuron i (Layer ) )

Fully Connected Layer : bias for neuron i at layer l …… …… Layer

Fully Connected Layer : bias for neuron i at layer l …… …… Layer nodes bias for all neurons in layer l

Fully Connected Layer : input of the activation function for neuron i at layer

Fully Connected Layer : input of the activation function for neuron i at layer l : input of the activation function all the neurons in layer l …… …… …… Layer nodes

Relations between Layer Outputs …… …… Layer nodes

Relations between Layer Outputs …… …… Layer nodes

Relations between Layer Outputs … … …… …… Layer nodes

Relations between Layer Outputs … … …… …… Layer nodes

Relations between Layer Outputs …… …… Layer nodes

Relations between Layer Outputs …… …… Layer nodes

Relations between Layer Outputs …… …… Layer nodes

Relations between Layer Outputs …… …… Layer nodes

Basic Structure: Recurrent Structure Simplify the network by using the same function again and

Basic Structure: Recurrent Structure Simplify the network by using the same function again and again

Reference K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, J. Schmidhuber, "LSTM:

Reference K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, J. Schmidhuber, "LSTM: A Search Space Odyssey, " in IEEE Transactions on Neural Networks and Learning Systems, 2016 Rafal Józefowicz, Wojciech Zaremba, Ilya Sutskever, “An Empirical Exploration of Recurrent Network Architectures, ” in ICML, 2015 https: //www. cs. toronto. edu/~ graves/preprint. pdf

Recurrent Neural Network h and h’ are vectors with the same dimension • y

Recurrent Neural Network h and h’ are vectors with the same dimension • y 1 h 0 f x 1 y 2 h 1 f x 2 y 3 h 2 f h 3 …… x 3 No matter how long the input/output sequence is, we only need one function f

Deep RNN … … … b 0 c 1 c 2 c 3 f

Deep RNN … … … b 0 c 1 c 2 c 3 f 2 b 1 y 1 h 0 f 1 x 1 f 2 b 2 y 2 h 1 f 1 x 2 f 2 b 3 …… y 3 h 2 f 1 x 3 h 3 ……

Bidirectional RNN x 1 b 0 f 2 x 2 b 1 c 1

Bidirectional RNN x 1 b 0 f 2 x 2 b 1 c 1 f 3 f 1 x 1 b 2 c 2 y 1 a 1 h 0 f 2 x 3 f 1 x 2 b 3 c 3 y 2 a 2 h 1 f 2 f 3 y 3 a 3 h 2 f 1 x 3 h 3

Pyramidal RNN • Reducing the number of time steps W. Chan, N. Jaitly, Q.

Pyramidal RNN • Reducing the number of time steps W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, ” ICASSP, 2016

Naïve RNN • y h f x h' Wh h y Wo h' Wi

Naïve RNN • y h f x h' Wh h y Wo h' Wi x h' softmax Ignore bias here

yt LSTM yt ct-1 ct LSTM ht-1 Naive xt ht ht-1 ht xt c

yt LSTM yt ct-1 ct LSTM ht-1 Naive xt ht ht-1 ht xt c change slowly ct is ct-1 added by something h change faster ht and ht-1 can be very different

z ct-1 zi zf zf zi z ht-1 xt W Wi Wf zo zo

z ct-1 zi zf zf zi z ht-1 xt W Wi Wf zo zo xt Wo ht-1 xt ht-1

xt z W ct-1 ht-1 ct-1 diagonal “peephole ” zf zo zi z ht-1

xt z W ct-1 ht-1 ct-1 diagonal “peephole ” zf zo zi z ht-1 xt zo zf zi obtained by the same way

yt ct ct-1 tanh zf zi z ht-1 xt zo ht

yt ct ct-1 tanh zf zi z ht-1 xt zo ht

LSTM yt+1 yt ct+1 ct ct-1 tanh zf zi z tanh zo zf zi

LSTM yt+1 yt ct+1 ct ct-1 tanh zf zi z tanh zo zf zi z zo ht ht-1 xt ht xt+1

GRU yt ht ht-1 1 reset update r z ht-1 xt h' xt

GRU yt ht ht-1 1 reset update r z ht-1 xt h' xt

Example Task • (Simplified) Speech Recognition: Frame classification on TIMIT y 1 y 2

Example Task • (Simplified) Speech Recognition: Frame classification on TIMIT y 1 y 2 y 3 y 4 TSI TSI I x 1 x 2 x 3 x 4 y 1 y 2 y 3 y 4 …… I N …… Utterance 1 N N S …… S @ @ x 1 x 2 x 3 x 4 …… Utterance 2

Target Delay • Only for unidirectional RNN Delay 3 steps: True labels: x x

Target Delay • Only for unidirectional RNN Delay 3 steps: True labels: x x x TSI TSI TSI I I N N N N

LSTM > RNN > feedforward Bi-direction > uni-direction

LSTM > RNN > feedforward Bi-direction > uni-direction

Forward direction Reverse direction

Forward direction Reverse direction

Training LSTM is faster than RNN

Training LSTM is faster than RNN

LSTM: A Search Space Odyssey Standard LSTM works well Simply LSTM: coupling input and

LSTM: A Search Space Odyssey Standard LSTM works well Simply LSTM: coupling input and forget gate, removing peephole Forget gate is critical for performance Output gate activation function is critical

An Empirical Exploration of Recurrent Network Architectures LSTM-f/i/o: removing forget/input/output gates LSTM-b: large bias

An Empirical Exploration of Recurrent Network Architectures LSTM-f/i/o: removing forget/input/output gates LSTM-b: large bias Importance: forget > input > output Large bias forget gate is helpful

An Empirical Exploration of Recurrent Network Architectures

An Empirical Exploration of Recurrent Network Architectures

Basic Structure: Convolutional / Pooing Layer Simplify the neural network (based on prior knowledge

Basic Structure: Convolutional / Pooing Layer Simplify the neural network (based on prior knowledge to the task)

Convolutional Layer Receptive Field Sparse Connectivity Each neural only connects to part of the

Convolutional Layer Receptive Field Sparse Connectivity Each neural only connects to part of the output of the previous layer Different neurons have different, but overlapping, receptive fields ……

Convolutional Layer Sparse Connectivity Each neural only connects to part of the output of

Convolutional Layer Sparse Connectivity Each neural only connects to part of the output of the previous layer Parameter Sharing The neurons with different receptive fields can use the same set of parameters. …… Less parameters then fully connected layer

Convolutional Layer Considering neuron 1 and 3 as “filter 1” (kernel 1) filter (kernel)

Convolutional Layer Considering neuron 1 and 3 as “filter 1” (kernel 1) filter (kernel) size: size of the receptive field of a neuron Stride = 2 Considering neuron 2 and 4 as “filter 2” (kernel 2) …… Kernel size, no. of filter, stride are all designed by the developers.

Example – 1 D Signal + Single Channel Classification, Predict the future … Audio

Example – 1 D Signal + Single Channel Classification, Predict the future … Audio Signal, Stock Value …

Example – 1 D Signal + Multiple Channel A document: each word is a

Example – 1 D Signal + Multiple Channel A document: each word is a vector I like this movie very much …… Does this kind of receptive field make sense?

Example – 2 D Signal + Single Channel 0 1 0 0 1 1

Example – 2 D Signal + Single Channel 0 1 0 0 1 1 0 0 0 1 1 0 0 6 x 6 black & white picture image 7 0 : 8 1 : 9 0 : 0 10: … 1 0 0 1 … Size of Receptive field is 3 x 3, Stride is 1 1 1 : 2 0 : 3 0 : 4 0 : 13 0 : 0 14 : 15: 1 16: 1 Only show 1 filter here …

Example – 2 D Signal + Multiple Channel 7 0 0 : 8 1

Example – 2 D Signal + Multiple Channel 7 0 0 : 8 1 1 : 9 0 1 : 0 0 10: 0 0 0 0 … 1 0 0 0 0 1 0 11 00 00 01 00 1 0 0 00 11 01 00 10 0 1 1 0 0 1 00 00 10 11 00 0 11 00 00 01 10 0 0 1 0 0 00 11 00 01 10 0 1 0 6 x 6 colorful image 0 … Size of Receptive field is 3 x 3 x 3, Stride is 1 1 : 2 0 0 : 3 0 1 : 4 0 0 : 13 0 1 : 0 1 14 : 15: 1 0 16: 1 1 1 0 Only show 1 filter here …

Without Zero Padding

Without Zero Padding

Pooling Layer … Average Pooling: Max Pooling: … nodes … … Layer nodes L

Pooling Layer … Average Pooling: Max Pooling: … nodes … … Layer nodes L 2 Pooling:

Pooling Layer Convolutional Layer Which outputs should be grouped together? Pooling Layer Subsampling ……

Pooling Layer Convolutional Layer Which outputs should be grouped together? Pooling Layer Subsampling …… Group the neurons corresponding to the same filter with nearby receptive fields

Pooling Layer Convolutional Layer Which outputs should be grouped together? Pooling Layer Maxout Network

Pooling Layer Convolutional Layer Which outputs should be grouped together? Pooling Layer Maxout Network How do you know which neurons detect the same pattern? …… Group the neurons with the same receptive field

Combination of Different Basic Layers Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin

Combination of Different Basic Layers Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-end. With Raw Waveform CLDNNs, ” In INTERPSEECH 2015

Combination of Different Basic Layers Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin

Combination of Different Basic Layers Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-end. With Raw Waveform CLDNNs, ” In INTERPSEECH 2015

Combination of Different Basic Layers Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin

Combination of Different Basic Layers Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-end. With Raw Waveform CLDNNs, ” In INTERPSEECH 2015 3 layers

Next Time • 3/10: TAs will teach Tensor. Flow • Tensor. Flow for regression

Next Time • 3/10: TAs will teach Tensor. Flow • Tensor. Flow for regression • Tensor. Flow for word vector • word vector: https: //www. youtube. com/watch? v=X 7 PH 3 Nu. YW 0 Q • Tensor. Flow for CNN • If you want to learn Theano • http: //speech. ee. ntu. edu. tw/~tlkagk/courses/MLDS_20 15_2/Lecture/Theano%20 DNN. ecm. mp 4/index. html • http: //speech. ee. ntu. edu. tw/~tlkagk/courses/MLDS_20 15_2/Lecture/Theano%20 RNN. ecm. mp 4/index. html