CSE 190 Neural Networks The Neural Turing Machine

  • Slides: 45
Download presentation
CSE 190 Neural Networks: The Neural Turing Machine Gary Cottrell Week 10 Lecture 2

CSE 190 Neural Networks: The Neural Turing Machine Gary Cottrell Week 10 Lecture 2 9/18/2020 CSE 190 1

Introduction n Neural nets have been shown to be Turingequivalent (Seigelmann & Sontag) n

Introduction n Neural nets have been shown to be Turingequivalent (Seigelmann & Sontag) n But can they learn programs? n Yes! CSE 190 9/18/2020 2

Introduction n One can explicitly teach a neural net to implement a program –

Introduction n One can explicitly teach a neural net to implement a program – e. g. Tsung & Cottrell (1993), e. g. , sequential addition: CSE 190 9/18/2020 3

Introduction n One can explicitly teach a neural net to implement a program –

Introduction n One can explicitly teach a neural net to implement a program – e. g. Tsung & Cottrell (1993) CSE 190 9/18/2020 4

Introduction n One can explicitly teach a neural net to implement a program –

Introduction n One can explicitly teach a neural net to implement a program – e. g. Tsung & Cottrell (1993) CSE 190 9/18/2020 5

Introduction n n One can explicitly teach a neural net to implement a program

Introduction n n One can explicitly teach a neural net to implement a program – e. g. Tsung & Cottrell (1993) That’s nice, but: The network had to learn to remember things in its internal state space – when it had to remember one extra bit, it took 5, 000 more epochs! n We told the model exactly what action to take on every step n n Can a network learn what to do just by seeing input-output examples? CSE 190 9/18/2020 6

The Neural Turing Machine A Turing Machine has a finite state controller, and a

The Neural Turing Machine A Turing Machine has a finite state controller, and a tape that it can read from and write to. Finite State Machine CSE 190 9/18/2020 7

The Neural Turing Machine Can we build one of these out of a neural

The Neural Turing Machine Can we build one of these out of a neural net? The idea seems like it might be similar to this one: a Turing Machine made from tinker toys that plays tic-tac-toe CSE 190 9/18/2020 8

The Neural Turing Machine It’s a nice party trick, but totally impractical. Does an

The Neural Turing Machine It’s a nice party trick, but totally impractical. Does an NTM have to be the same way? No!! CSE 190 9/18/2020 9

The Neural Turing Machine The Main Idea: Add a structured memory to a neural

The Neural Turing Machine The Main Idea: Add a structured memory to a neural controller that it can write to and read from. CSE 190 9/18/2020 10

The Neural Turing Machine This is completely differentiable from end to end – so

The Neural Turing Machine This is completely differentiable from end to end – so it can be trained by BPTT! CSE 190 9/18/2020 11

The Neural Turing Machine The controller can be a feed-forward network or a recurrent

The Neural Turing Machine The controller can be a feed-forward network or a recurrent one – with LSTM units. The recurrent one works better. CSE 190 9/18/2020 12

The Neural Turing Machine The read and write heads are reasonably simple – the

The Neural Turing Machine The read and write heads are reasonably simple – the addressing mechanism, not so much. The addressing mechanism allows both contentaddressable and location addressing (the normal kind) It is a highly structured system, but differentiable. CSE 190 9/18/2020 13

The Neural Turing Machine The memory is simply a matrix of linear neurons. M=

The Neural Turing Machine The memory is simply a matrix of linear neurons. M= Think of each row as a “word” of memory We will read and write a row at a time. So this memory has 3 words of length 4. CSE 190 9/18/2020 14

The Neural Turing Machine The memory is simply a matrix of linear neurons. M=

The Neural Turing Machine The memory is simply a matrix of linear neurons. M= Indexing the rows as i=1, 2, 3, then we can read from this using: Where rt is the vector read out from the memory at time t, and wt is a softmax column vector of length 3, which is the address! CSE 190 9/18/2020 15

Read Example M= Suppose we want the second row. If w = (0, 1,

Read Example M= Suppose we want the second row. If w = (0, 1, 0)T then Yields rt=(5, 9, 9, 7) Again, note how w is the address CSE 190 9/18/2020 16

Read Example M= If w = (½, ½, 0)T then Yields (4, 5, 6.

Read Example M= If w = (½, ½, 0)T then Yields (4, 5, 6. 5, 4) (average of rows 1 and 2) So you can even do a simple computation with the addressing mechanism! CSE 190 9/18/2020 17

Read Example M= �Note: this expression is clearly differentiab with respect to M and

Read Example M= �Note: this expression is clearly differentiab with respect to M and w. CSE 190 9/18/2020 18

Writing to the memory n n Inspired by gating from LSTM networks, they use

Writing to the memory n n Inspired by gating from LSTM networks, they use gates to write to memory. A write operation happens in two steps: Erase memories n Add to them. n CSE 190 9/18/2020 19

Erasing the memory n n n A write operation happens in two steps: n

Erasing the memory n n n A write operation happens in two steps: n Erase memories n Add to them. Erasing memory element i: Again, w is the address, so if wt(i) is 1 at time t, and et is an erase vector, then the memory is modified. (et is multiplied point-wise times the memory vector. ) CSE 190 9/18/2020 20

Erasing the memory n Erasing memory element i: M= n If w = (0,

Erasing the memory n Erasing memory element i: M= n If w = (0, 1, 0), and et = (1, 1, 1, 1) then this yields M= n Obviously, more complex “erases” are possible. CSE 190 9/18/2020 21

Adding to the memory n Adding to memory element i: M= n If w

Adding to the memory n Adding to memory element i: M= n If w = (0, 1, 0), and at = (1, 2, 3, 4) then this yields M= n Question for you: n Why do we have to do a write in two steps? CSE 190 9/18/2020 22

Adding to the memory n Question for you: n Why do we have to

Adding to the memory n Question for you: n Why do we have to do a write in two steps? n The memory is a bunch of linear neurons: There is no “overwrite” operation in arithmetic. n We could use an add vector that is the new vector minus the old vector, but that requires three steps! (What are they? ) n We would have to do a read, compute the modification vector, and then add it n CSE 190 9/18/2020 23

Computing w, the address n This is a complicated mechanism, designed to allow both

Computing w, the address n This is a complicated mechanism, designed to allow both content-addressability as well as location-based addressing CSE 190 9/18/2020 24

Computing w, the address These parameters must be computed by the controller. CSE 190

Computing w, the address These parameters must be computed by the controller. CSE 190 9/18/2020 25

Computing w, the address • • • kt is a key vector for content-addressable

Computing w, the address • • • kt is a key vector for content-addressable memory - we want the memory element most similar to this vector (for our 3 X 4 memory, this would be a length 4 vector) βt is a gain parameter on the content-match (sharpening it) gt is a switch (gate) between content- and location-based addressing st is a shift vector that can increment or decrement the address, or leave it alone γt is gain parameter on the softmax address, making it more binary CSE 190 9/18/2020 26

Content-based addressing CSE 190 9/18/2020 27

Content-based addressing CSE 190 9/18/2020 27

Content-based addressing - kt is a key vector for contentaddressable memory - we want

Content-based addressing - kt is a key vector for contentaddressable memory - we want the memory element most similar to this vector - βt is a gain parameter on the content-match (sharpening it) - So, we create an address vector where each entry is based on how similar that memory element is to the key. Where K is cosine similarity – so this is a softmax on the similarity CSE 190 9/18/2020 28

Content-based addressing - M= So, suppose k = (3, 1, 0, 0), and suppose

Content-based addressing - M= So, suppose k = (3, 1, 0, 0), and suppose the match score (cosine) is 0. 8, 0. 1, and β=1, this operation will produce an address vector of wc = (0. 5, 0. 25) but if β=10, wc = (0. 998, 0. 001). So a large β sharpens the address. CSE 190 9/18/2020 29

Location-based addressing This part of the model takes in the content-based address, and a

Location-based addressing This part of the model takes in the content-based address, and a gate variable gt, which switches between the contentbased addressing and the previous memory address vector (so we can increment or decrement it). CSE 190 9/18/2020 30

Location-based addressing What the “Interpolation” box does: I. e. , it if gt=1, we

Location-based addressing What the “Interpolation” box does: I. e. , it if gt=1, we use the content-based address, if gt=0, we pass through the previous address, wt-1. A mixture (e. g. , gt=0. 5) doesn’t really make sense… CSE 190 9/18/2020 31

Location-based addressing The “Convolutional shift” box takes in the content-based or the location-based address,

Location-based addressing The “Convolutional shift” box takes in the content-based or the location-based address, wtg and then can increment it by -1, 0, or +1, based on the st vector (larger ranges are possible) CSE 190 9/18/2020 32

Location-based addressing The “Convolutional shift” box operation: circular convolution: E. g. , if st=

Location-based addressing The “Convolutional shift” box operation: circular convolution: E. g. , if st= (0, 0, 1), this increments the address by 1: (0, 1, 0) -> (0, 0, 1) CSE 190 9/18/2020 33

Location-based addressing The “Sharpening” box operation again applies a high-gain softmax on the address,

Location-based addressing The “Sharpening” box operation again applies a high-gain softmax on the address, based on ϒt: E. g. , if ϒ=2, (. 9, . 1, 0) -> (. 99, . 01, 0) CSE 190 9/18/2020 34

Computing the address: 1. Creating a vector address based on similarity to existing memories

Computing the address: 1. Creating a vector address based on similarity to existing memories 2. Switching between content and location 3. Incrementing or decrementing the address 1. Sharpening the address CSE 190 9/18/2020 35

Computing the address: All of these operations are differentiable with respect to all of

Computing the address: All of these operations are differentiable with respect to all of their parameters, which means we can backprop through them to change the parameters! CSE 190 9/18/2020 36

Ok, what can we do with this? Copy n Repeated Copy n Associative recall

Ok, what can we do with this? Copy n Repeated Copy n Associative recall n Dynamic N-grams n Priority Sort They compare a feed-forward controller, an LSTM controller, and a vanilla LSTM net on these tasks n CSE 190 9/18/2020 37

Ok, what can we do with this? n n n Copy Repeated Copy Associative

Ok, what can we do with this? n n n Copy Repeated Copy Associative recall Dynamic N-grams Priority Sort CSE 190 9/18/2020 38

Copy n n Input: some sequence, e. g. , 1, 2, 3, 4 Target:

Copy n n Input: some sequence, e. g. , 1, 2, 3, 4 Target: after the end of the sequence, 1, 2, 3, 4 CSE 190 9/18/2020 39

Copy: Results (LSTM controller) n Note it was only trained on length 20 sequences

Copy: Results (LSTM controller) n Note it was only trained on length 20 sequences CSE 190 9/18/2020 40

Copy: Results (LSTM network) n Note it was only trained on length 20 sequences

Copy: Results (LSTM network) n Note it was only trained on length 20 sequences CSE 190 9/18/2020 41

Copy: Results (LSTM NTM) CSE 190 9/18/2020 42

Copy: Results (LSTM NTM) CSE 190 9/18/2020 42

Copy: The program it learned CSE 190 9/18/2020 43

Copy: The program it learned CSE 190 9/18/2020 43

Priority Sort: the training CSE 190 9/18/2020 44

Priority Sort: the training CSE 190 9/18/2020 44

Priority Sort: the results “We hypothesize that it uses the priorities to determine the

Priority Sort: the results “We hypothesize that it uses the priorities to determine the relative location of each write. To test this hypothesis we fitted a linear function of the priority to the observed write locations. Figure 17 shows that the locations returned by the linear function closely match the observed write locations. It also shows that the network reads from the memory locations in increasing order, thereby traversing the sorted sequence. ” CSE 190 9/18/2020 45