Transformer Outline l Seq 2 Seq problem l

Transformer

Outline l Seq 2 Seq problem l Quick review - Simple RNN - LSTM l Seq 2 Seq model - Model - Attention-based model l Transformer l Application - BERT - GPT l RNN vs Transformer

Seq 2 Seq Problem “genetic algorithm” “基因演算法” “Hello” “基因演算法” “A boy ride on a bike”

RNN l Slot filling “arrive Taipei on November 27 th” other destination other time “leave Taipei on November 27 th” other place of departure other time https: //www. youtube. com/watch? v=Zjfj. Pz. Xw 6 og&feature=youtu. be

RNN l Slot filling place of other destination time departure “arrive Taipei on November 27 th” other destination other time “leave Taipei on November 27 th” other place of departure other time word https: //www. youtube. com/watch? v=Zjfj. Pz. Xw 6 og&feature=youtu. be vector

RNN l Given function f : h’ , y = f (h, x) h 0 other destination other y 1 y 2 y 3 f h 1 f h 2 f x 1 x 2 x 3 arrive Taipei on https: //www. youtube. com/watch? v=Zjfj. Pz. Xw 6 og&feature=youtu. be ……

LSTM l LSTM architecture http: //hemingwang. blogspot. com/2018/02/airnnlstmin-120 -mins. html

Seq 2 Seq Model l Sequence generation arrive h 0 f <BOS> Taipei h 1 f arrive on h 2 f Taipei November h 3 f on ……

Seq 2 Seq Model l Conditional sequence generation - Image Caption “A boy ride on a bike” A boy ride vector CNN h 0 vector f <BOS> h 1 vector f A h 2 vector f boy ……

Seq 2 Seq Model l Conditional sequence generation - Translation “genetic algorithm” “基因演算法” vector h 0 f 1 基 h 1 f 1 因 h 2 f 1 演 h 3 f 1 算 h 4 f 1 法

Seq 2 Seq Model l Conditional sequence generation - Translation “genetic algorithm” “基因演算法” genetic k 0 vector h 0 f 1 基 algorithm f 2 <BOS> h 1 k 1 vector f 1 因 h 2 f 2 genetic vector f 1 演 h 3 f 1 算 h 4 f 1 法

Seq 2 Seq Model l Conditional sequence generation - Translation “genetic algorithm” “基因演算法” genetic k 0 vector h 0 f 1 基 algorithm f 2 <BOS> h 1 k 1 vector f 1 因 h 2 f 2 Decoder genetic vector f 1 演 h 3 f 1 算 h 4 f 1 法 Encoder

Seq 2 Seq Model l Attention-based model - Translation “genetic algorithm” “基因演算法” genetic k 0 algorithm f 2 k 1 <BOS> c 1 f 2 Decoder genetic c 2 c 1 h 0 f 1 基 h 1 f 1 因 h 2 f 1 演 h 3 f 1 算 h 4 f 1 法 Encoder

Seq 2 Seq Model l Result of Transformer

Seq 2 Seq Model l Attention-based model - Translation “genetic algorithm” “基因演算法” s 1 h 0 f 1 基 h 1 s 2 f 1 因 h 2 s 3 f 1 演 h 3 s 4 f 1 算 h 4 s 5 f 1 法 h 5 z 0

Seq 2 Seq Model l Attention-based model - Translation “genetic algorithm” “基因演算法” 0. 5 0 0 0 s 4 s 5 Attention softmax s 1 h 0 f 1 基 h 1 s 2 f 1 因 h 2 s 3 f 1 演 h 3 f 1 算 h 4 f 1 法 h 5 z 0

Seq 2 Seq Model l Attention-based model - Translation “genetic algorithm” “基因演算法” c 1 *h 1 0. 5 *h 2 *h 3 0. 5 *h 4 0 *h 5 0 0 s 4 s 5 Attention softmax s 1 h 0 f 1 基 h 1 s 2 f 1 因 h 2 s 3 f 1 演 h 3 f 1 算 h 4 f 1 法 h 5 z 0

Seq 2 Seq Model l Attention-based model

Seq 2 Seq Model l Attention-based model - Translation “genetic algorithm” “基因演算法” c 1 *h 1 0. 5 *h 2 *h 3 0. 5 *h 4 0 *h 5 0 0 s 4 s 5 Attention softmax s 1 h 0 f 1 基 h 1 s 2 f 1 因 h 2 s 3 f 1 演 h 3 f 1 算 h 4 f 1 法 h 5 z 0

Seq 2 Seq Model l Attention-based model - Translation “genetic algorithm” “基因演算法” c 1 CNN *h 1 0. 5 *h 2 *h 3 0. 5 *h 4 0 *h 5 0 0 s 4 s 5 Attention softmax s 1 h 0 f 1 基 h 1 s 2 f 1 因 h 2 s 3 f 1 演 h 3 f 1 算 h 4 f 1 法 h 5 z 0

Transformer l Self-attention - q : query - k : key - v : value q 1 k 1 基 v 1 q 2 k 2 因 v 2 q 3 k 3 演 v 3 q 4 k 4 算 v 4 q 5 k 5 法 v 5

Transformer l a 1 q 1 a 2 k 1 基 v 1 q 2 k 2 因 v 2 q 3 a 5 a 4 a 3 k 3 演 v 3 q 4 k 4 算 v 4 q 5 k 5 法 v 5

Transformer l a'1 a'2 a'3 a'4 a'5 a 4 a 5 softmax a 1 q 1 a 2 k 1 基 v 1 q 2 a 3 k 2 因 v 2 q 3 k 3 演 v 3 q 4 k 4 算 v 4 q 5 k 5 法 v 5

Transformer l Self-attention b 1 x a'2 x a'3 x a'4 x a'5 softmax a 1 q 1 a 2 k 1 基 v 1 q 2 k 2 因 v 2 q 3 a 5 a 4 a 3 k 3 演 v 3 q 4 k 4 算 v 4 q 5 k 5 法 v 5

Transformer l Self-attention b 1 b 2 b 3 b 4 b 5 算法 Self-attention 基因演

Transformer

Transformer b 11 l Multi-head attention b 11 x a’ 21 x a’ 31 x a’ 41 x a’ 51 softmax a 11 q 11 a 21 k 11 基 v 11 q 21 k 21 因 v 21 q 31 a 51 a 41 a 31 k 31 演 v 31 q 41 k 41 算 v 41 q 51 k 51 法 v 51

Transformer b 11 b 12 l Multi-head attention b 12 x a’ 22 x a’ 32 x a’ 42 x a’ 52 softmax a 12 q 12 a 22 k 12 基 v 12 q 22 k 22 因 v 22 q 32 a 52 a 42 a 32 k 32 演 v 32 q 42 k 42 算 v 42 q 52 k 52 法 v 52

Transformer b 11 b 12 l Multi-head attention b 13 x a’ 23 x a’ 33 x a’ 43 x a’ 53 softmax a 13 q 13 a 23 k 13 基 v 13 q 23 k 23 因 v 23 q 33 a 53 a 43 a 33 k 33 演 v 33 q 43 k 43 算 v 43 q 53 k 53 法 v 53

Transformer b 11 = b 1 l Multi-head attention x W b 12 b 13 x a’ 23 x a’ 33 x a’ 43 x a’ 53 softmax a 13 q 13 a 23 k 13 基 v 13 q 23 k 23 因 v 23 q 33 a 53 a 43 a 33 k 33 演 v 33 q 43 k 43 算 v 43 q 53 k 53 法 v 53

Transformer l Multi-head attention

Transformer l Positional encoding - each position has a unique positional vector q 1 k 1 基 v 1 q 2 k 2 因 v 2 q 3 k 3 演 v 3 q 4 k 4 算 v 4 q 5 k 5 法 v 5

Transformer l q 1 k 1 基 1 0 0 v 1 q 2 k 2 因 v 2 q 3 k 3 演 v 3 q 4 k 4 算 v 4 q 5 k 5 法 v 5

Transformer

Application l BERT - BERT is encoder of Transformer v 1 v 2 v 3 v 4 v 5 算法 BERT 基因演 encoder

Application l Training of BERT - Masked LM vocabulary size linear multi-classifier v 1 v 2 v 3 v 4 v 5 因 [MASK] 算法 BERT 基

Application l Training of BERT - Next sentence prediction - [SEP] : the boundary of two sentences - [CLS] : the position that outputs classification results Linear binary classifier v 1 BERT [CLS] 好久不見 [SEP] 最近好嗎

Application l GPT - GPT is decoder of Transformer y 1 y 2 y 3 y 4 y 5 算法 GPT 基因演 decoder

Application l GPT softmax q 1 k 1 v 1 <BOS> q 2 k 2 基 v 2

Application l GPT 因 x a'1 x a'2 softmax a 1 q 1 a 2 k 1 v 1 <BOS> q 2 k 2 基 v 2

Application l GPT 因 softmax q 1 k 1 v 1 <BOS> q 2 k 2 基 v 2 q 3 k 3 因 v 3

Application l GPT 因 x a'1 演 x a'2 x a'3 softmax a 1 q 1 a 2 k 1 v 1 <BOS> q 2 a 3 k 2 基 v 2 q 3 k 3 因 v 3

Application l GPT 因演 softmax q 1 k 1 v 1 <BOS> q 2 k 2 基 v 2 q 3 k 3 因 v 3 q 4 k 4 演 v 4

Application l GPT 因 x a'1 算演 x a'2 x a'3 x a'4 softmax a 1 q 1 a 2 k 1 v 1 <BOS> q 2 a 4 a 3 k 2 基 v 2 q 3 k 3 因 v 3 q 4 k 4 演 v 4

Application l GPT 因算演 softmax q 1 k 1 v 1 <BOS> q 2 k 2 基 v 2 q 3 k 3 因 v 3 q 4 k 4 演 v 4 q 5 k 5 算 v 5

Application l GPT 因 x a'1 算演 x a'2 x a'3 法 x a'4 x a'5 softmax a 1 q 1 a 2 k 1 v 1 <BOS> q 2 k 2 基 v 2 q 3 a 5 a 4 a 3 k 3 因 v 3 q 4 k 4 演 v 4 q 5 k 5 算 v 5

RNN vs Transformer