CS 4803 7643 Deep Learning Topics Recurrent Neural

  • Slides: 54
Download presentation
CS 4803 / 7643: Deep Learning Topics: – Recurrent Neural Networks (RNNs) – Back.

CS 4803 / 7643: Deep Learning Topics: – Recurrent Neural Networks (RNNs) – Back. Prop Through Time (BPTT) Dhruv Batra Georgia Tech

Administrativia • HW 3 Released – – (C) Dhruv Batra Due: 11/06, 11: 55

Administrativia • HW 3 Released – – (C) Dhruv Batra Due: 11/06, 11: 55 pm Last HW Focus on projects after this https: //www. cc. gatech. edu/classes/AY 2019/cs 7643_fall/asse ts/hw 3. pdf 2

Plan for Today • Model – Recurrent Neural Networks (RNNs) • Learning – Back.

Plan for Today • Model – Recurrent Neural Networks (RNNs) • Learning – Back. Prop Through Time (BPTT) (C) Dhruv Batra 3

New Topic: RNNs (C) Dhruv Batra Image Credit: Andrej Karpathy 4

New Topic: RNNs (C) Dhruv Batra Image Credit: Andrej Karpathy 4

New Words • Recurrent Neural Networks (RNNs) • Recursive Neural Networks – General family;

New Words • Recurrent Neural Networks (RNNs) • Recursive Neural Networks – General family; think graphs instead of chains • Types: – – “Vanilla” RNNs (Elman Networks) Long Short Term Memory (LSTMs) Gated Recurrent Units (GRUs) … • Algorithms – Back. Prop Through Time (BPTT) – Back. Prop Through Structure (BPTS) (C) Dhruv Batra 5

What’s wrong with MLPs? • Problem 1: Can’t model sequences – Fixed-sized Inputs &

What’s wrong with MLPs? • Problem 1: Can’t model sequences – Fixed-sized Inputs & Outputs – No temporal structure (C) Dhruv Batra Image Credit: Alex Graves, book 6

What’s wrong with MLPs? • Problem 1: Can’t model sequences – Fixed-sized Inputs &

What’s wrong with MLPs? • Problem 1: Can’t model sequences – Fixed-sized Inputs & Outputs – No temporal structure • Problem 2: Pure feed-forward processing – No “memory”, no feedback (C) Dhruv Batra Image Credit: Alex Graves, book 7

Why model sequences? Figure Credit: Carlos Guestrin

Why model sequences? Figure Credit: Carlos Guestrin

Why model sequences? (C) Dhruv Batra Image Credit: Alex Graves 9

Why model sequences? (C) Dhruv Batra Image Credit: Alex Graves 9

Sequences are everywhere… (C) Dhruv Batra Image Credit: Alex Graves and Kevin Gimpel 10

Sequences are everywhere… (C) Dhruv Batra Image Credit: Alex Graves and Kevin Gimpel 10

Even where you might not expect a sequence… (C) Dhruv Batra Image Credit: Vinyals

Even where you might not expect a sequence… (C) Dhruv Batra Image Credit: Vinyals et al. 11

Even where you might not expect a sequence… Classify images by taking a series

Even where you might not expect a sequence… Classify images by taking a series of “glimpses” Ba, Mnih, and Kavukcuoglu, “Multiple Object Recognition with Visual Attention”, ICLR 2015. Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015 Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with permission. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Even where you might not expect a sequence… • Output ordering = sequence (C)

Even where you might not expect a sequence… • Output ordering = sequence (C) Dhruv Batra Image Credit: Ba et al. ; Gregor et al 13

(C) Dhruv Batra Image Credit: [Pinheiro and Collobert, ICML 14] 14

(C) Dhruv Batra Image Credit: [Pinheiro and Collobert, ICML 14] 14

Sequences in Input or Output? • It’s a spectrum… Input: No sequence Output: No

Sequences in Input or Output? • It’s a spectrum… Input: No sequence Output: No sequence Example: “standard” classification / regression problems (C) Dhruv Batra Image Credit: Andrej Karpathy 15

Sequences in Input or Output? • It’s a spectrum… Input: No sequence Output: No

Sequences in Input or Output? • It’s a spectrum… Input: No sequence Output: No sequence Example: “standard” classification / regression problems (C) Dhruv Batra Output: Sequence Example: Im 2 Caption Image Credit: Andrej Karpathy 16

Sequences in Input or Output? • It’s a spectrum… Input: No sequence Input: Sequence

Sequences in Input or Output? • It’s a spectrum… Input: No sequence Input: Sequence Output: No sequence Example: “standard” classification / regression problems (C) Dhruv Batra Example: Im 2 Caption Example: sentence classification, multiple-choice question answering Image Credit: Andrej Karpathy 17

Sequences in Input or Output? • It’s a spectrum… Input: No sequence Input: Sequence

Sequences in Input or Output? • It’s a spectrum… Input: No sequence Input: Sequence Output: No sequence Example: “standard” classification / regression problems (C) Dhruv Batra Example: Im 2 Caption Example: sentence classification, multiple-choice question answering Example: machine translation, video classification, video captioning, open-ended question answering Image Credit: Andrej Karpathy 18

2 Key Ideas • Parameter Sharing – in computation graphs = adding gradients (C)

2 Key Ideas • Parameter Sharing – in computation graphs = adding gradients (C) Dhruv Batra 19

Computational Graph (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 20

Computational Graph (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 20

(C) Dhruv Batra 21

(C) Dhruv Batra 21

Gradients add at branches + Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS

Gradients add at branches + Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

2 Key Ideas • Parameter Sharing – in computation graphs = adding gradients •

2 Key Ideas • Parameter Sharing – in computation graphs = adding gradients • “Unrolling” – in computation graphs with parameter sharing (C) Dhruv Batra 24

How do we model sequences? • No input (C) Dhruv Batra Image Credit: Bengio,

How do we model sequences? • No input (C) Dhruv Batra Image Credit: Bengio, Goodfellow, Courville 26

How do we model sequences? • No input (C) Dhruv Batra Image Credit: Bengio,

How do we model sequences? • No input (C) Dhruv Batra Image Credit: Bengio, Goodfellow, Courville 27

How do we model sequences? • With inputs (C) Dhruv Batra Image Credit: Bengio,

How do we model sequences? • With inputs (C) Dhruv Batra Image Credit: Bengio, Goodfellow, Courville 28

2 Key Ideas • Parameter Sharing – in computation graphs = adding gradients •

2 Key Ideas • Parameter Sharing – in computation graphs = adding gradients • “Unrolling” – in computation graphs with parameter sharing • Parameter sharing + Unrolling – Allows modeling arbitrary sequence lengths! – Keeps numbers of parameters in check (C) Dhruv Batra 29

Recurrent Neural Network RNN x Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS

Recurrent Neural Network RNN x Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Recurrent Neural Network y usually want to predict a vector at some time steps

Recurrent Neural Network y usually want to predict a vector at some time steps RNN x Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Recurrent Neural Network We can process a sequence of vectors x by applying a

Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every time step: y RNN new state old state input vector at some time step some function with parameters W Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n x

Recurrent Neural Network We can process a sequence of vectors x by applying a

Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every time step: y RNN Notice: the same function and the same set of parameters are used at every time step. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n x

(Vanilla) Recurrent Neural Network The state consists of a single “hidden” vector h: y

(Vanilla) Recurrent Neural Network The state consists of a single “hidden” vector h: y RNN x Sometimes called a “Vanilla RNN” or an “Elman RNN” after Prof. Jeffrey Elman Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

RNN: Computational Graph h 0 f. W h 1 x 1 Slide Credit: Fei-Fei

RNN: Computational Graph h 0 f. W h 1 x 1 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

RNN: Computational Graph h 0 f. W x 1 h 1 f. W h

RNN: Computational Graph h 0 f. W x 1 h 1 f. W h 2 x 2 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

RNN: Computational Graph h 0 f. W x 1 h 1 f. W x

RNN: Computational Graph h 0 f. W x 1 h 1 f. W x 2 h 2 f. W h 3 x 3 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n … h. T

RNN: Computational Graph Re-use the same weight matrix at every time-step h 0 W

RNN: Computational Graph Re-use the same weight matrix at every time-step h 0 W f. W x 1 h 1 f. W x 2 h 2 f. W h 3 x 3 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n … h. T

RNN: Computational Graph: Many to Many y 2 y 1 h 0 W f.

RNN: Computational Graph: Many to Many y 2 y 1 h 0 W f. W x 1 h 1 f. W x 2 h 2 y 3 f. W h 3 x 3 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n y. T … h. T

RNN: Computational Graph: Many to Many h 0 W f. W x 1 y

RNN: Computational Graph: Many to Many h 0 W f. W x 1 y 1 L 1 y 2 L 2 y 3 L 3 y. T h 1 f. W h 2 f. W h 3 … h. T x 2 x 3 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n LT

L RNN: Computational Graph: Many to Many h 0 W f. W x 1

L RNN: Computational Graph: Many to Many h 0 W f. W x 1 y 1 L 1 y 2 L 2 y 3 L 3 y. T h 1 f. W h 2 f. W h 3 … h. T x 2 x 3 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n LT

RNN: Computational Graph: Many to One y h 0 W f. W x 1

RNN: Computational Graph: Many to One y h 0 W f. W x 1 h 1 f. W x 2 h 2 f. W h 3 x 3 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n … h. T

RNN: Computational Graph: One to Many y 2 y 1 h 0 W f.

RNN: Computational Graph: One to Many y 2 y 1 h 0 W f. W h 1 f. W h 2 y 3 f. W h 3 x Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n y. T … h. T

Sequence to Sequence: Many-to-one + one-to-many Many to one: Encode input sequence in a

Sequence to Sequence: Many-to-one + one-to-many Many to one: Encode input sequence in a single vector h 0 W 1 f. W x 1 h 1 f. W x 2 h 2 f. W h 3 … h. T x 3 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Sequence to Sequence: Many-to-one + one-to-many One to many: Produce output sequence from single

Sequence to Sequence: Many-to-one + one-to-many One to many: Produce output sequence from single input vector Many to one: Encode input sequence in a single vector h 0 W 1 f. W x 1 h 1 f. W x 2 h 2 f. W x 3 y 2 y 1 h 3 … h. T f. W h 1 f. W W 2 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n h 2 f. W …

Example: Character-level Language Model Vocabulary: [h, e, l, o] Example training sequence: “hello” Slide

Example: Character-level Language Model Vocabulary: [h, e, l, o] Example training sequence: “hello” Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Example: Character-level Language Model Vocabulary: [h, e, l, o] Example training sequence: “hello” Slide

Example: Character-level Language Model Vocabulary: [h, e, l, o] Example training sequence: “hello” Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Distributed Representations Toy Example • Local vs Distributed (C) Dhruv Batra Slide Credit: Moontae

Distributed Representations Toy Example • Local vs Distributed (C) Dhruv Batra Slide Credit: Moontae Lee 48

Distributed Representations Toy Example • Can we interpret each dimension? (C) Dhruv Batra Slide

Distributed Representations Toy Example • Can we interpret each dimension? (C) Dhruv Batra Slide Credit: Moontae Lee 49

Power of distributed representations! Local Distributed (C) Dhruv Batra Slide Credit: Moontae Lee 50

Power of distributed representations! Local Distributed (C) Dhruv Batra Slide Credit: Moontae Lee 50

Example: Character-level Language Model Vocabulary: [h, e, l, o] Example training sequence: “hello” Slide

Example: Character-level Language Model Vocabulary: [h, e, l, o] Example training sequence: “hello” Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Training Time: MLE / “Teacher Forcing” Example: Character-level Language Model Vocabulary: [h, e, l,

Training Time: MLE / “Teacher Forcing” Example: Character-level Language Model Vocabulary: [h, e, l, o] Example training sequence: “hello” Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Test Time: Sample / Argmax / Beam Search Example: Character -level Language Model Sampling

Test Time: Sample / Argmax / Beam Search Example: Character -level Language Model Sampling Sample Softmax “e” “l” “o” . 03. 13. 00. 84 . 25. 20. 05. 50 . 11. 17. 68. 03 . 11. 02. 08. 79 Vocabulary: [h, e, l, o] At test-time sample characters one at a time, feed back to model Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Test Time: Sample / Argmax / Beam Search Example: Character -level Language Model Sampling

Test Time: Sample / Argmax / Beam Search Example: Character -level Language Model Sampling Sample Softmax “e” “l” “o” . 03. 13. 00. 84 . 25. 20. 05. 50 . 11. 17. 68. 03 . 11. 02. 08. 79 Vocabulary: [h, e, l, o] At test-time sample characters one at a time, feed back to model Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Test Time: Sample / Argmax / Beam Search Example: Character -level Language Model Sampling

Test Time: Sample / Argmax / Beam Search Example: Character -level Language Model Sampling Sample Softmax “e” “l” “o” . 03. 13. 00. 84 . 25. 20. 05. 50 . 11. 17. 68. 03 . 11. 02. 08. 79 Vocabulary: [h, e, l, o] At test-time sample characters one at a time, feed back to model Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n

Test Time: Sample / Argmax / Beam Search Example: Character -level Language Model Sampling

Test Time: Sample / Argmax / Beam Search Example: Character -level Language Model Sampling Sample Softmax “e” “l” “o” . 03. 13. 00. 84 . 25. 20. 05. 50 . 11. 17. 68. 03 . 11. 02. 08. 79 Vocabulary: [h, e, l, o] At test-time sample characters one at a time, feed back to model Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231 n