Stochastic Gradient Descent Training for L 1 regularizaed

Log-linear models in NLP • Maximum entropy models – Text classification (Nigam et al.

Log-linear models • Log-linear (a. k. a. maximum entropy) model Weight Feature function Partition

Regularization • To avoid overfitting to the training data – Penalize the weights of

Training log-linear models • Numerical optimization methods – – Gradient descent (steepest descent or

Gradient Descent (Hill Climbing) objective 6

Stochastic Gradient Descent (SGD) objective Compute an approximate gradient using one training sample 7

Stochastic Gradient Descent (SGD) • Weight update procedure – very simple (similar to the

Using subgradients • Weight update procedure 9

Using subgradients • Problems – L 1 penalty needs to be applied to all

Clipping-at-zero approach w • Carpenter (2008) • Special case of the FOLOS algorithm (Duchi

• Text chunking Number of non-zero features Quasi-Newton 18, 109 SGD (Naive) 455,

Why it does not produce sparse models • In SGD, weights are not updated

Cumulative L 1 penalty • The absolute value of the total L 1 penalty

Applying L 1 with cumulative penalty • Penalize each weight according to the difference

Experiments • Model: Conditional Random Fields (CRFs) • Baseline: OWL-QN (Andrew and Gao, 2007)

Co. NLL 2000 chunking task: objective 19

Co. NLL 2000 chunking: non-zero features 20

Co. NLL 2000 chunking • Performance of the produced model Passes OWL-QN Obj. #

NLPBA 2004 named entity recognition Passes OWL-QN Obj. # Features Time (sec) F-score 160

Discussions • Convergence – Demonstrated empirically – Penalties applied are not i. i. d.

Conclusions • Stochastic gradient descent training for L 1 regularized log-linear models – Force

Slides: 24

Download presentation

Stochastic Gradient Descent Training for L 1 -regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou University of Manchester 1

Log-linear models in NLP • Maximum entropy models – Text classification (Nigam et al. , 1999) – History-based approaches (Ratnaparkhi, 1998) • Conditional random fields – Part-of-speech tagging (Lafferty et al. , 2001), chunking (Sha and Pereira, 2003), etc. • Structured prediction – Parsing (Clark and Curan, 2004), Semantic Role Labeling (Toutanova et al, 2005), etc. 2

Log-linear models • Log-linear (a. k. a. maximum entropy) model Weight Feature function Partition function: • Training – Maximize the conditional likelihood of the training data 3

Regularization • To avoid overfitting to the training data – Penalize the weights of the features • L 1 regularization – Most of the weights become zero – Produces sparse (compact) models – Saves memory and storage 4

Training log-linear models • Numerical optimization methods – – Gradient descent (steepest descent or hill-climbing) Quasi-Newton methods (e. g. BFGS, OWL-QN) Stochastic Gradient Descent (SGD) etc. • Training can take several hours (or even days), depending on the complexity of the model, the size of training data, etc. 5

Gradient Descent (Hill Climbing) objective 6

Stochastic Gradient Descent (SGD) objective Compute an approximate gradient using one training sample 7

Stochastic Gradient Descent (SGD) • Weight update procedure – very simple (similar to the Perceptron algorithm) Not differentiable : learning rate 8

Using subgradients • Weight update procedure 9

Using subgradients • Problems – L 1 penalty needs to be applied to all features (including the ones that are not used in the current sample). – Few weights become zero as a result of training. 10

Clipping-at-zero approach w • Carpenter (2008) • Special case of the FOLOS algorithm (Duchi and Singer, 2008) and the truncated gradient method (Langford et al. , 2009) • Enables lazy update 11

Clipping-at-zero approach 12

• Text chunking Number of non-zero features Quasi-Newton 18, 109 SGD (Naive) 455, 651 SGD (Clipping-at-zero) 87, 792 • Named entity recognition Number of non-zero features Quasi-Newton 30, 710 SGD (Naive) 1, 032, 962 SGD (Clipping-at-zero) 279, 886 • Part-of-speech tagging Number of non-zero features Quasi-Newton SGD (Naive) SGD (Clipping-at-zero) 50, 870 2, 142, 130 323, 199 13

Why it does not produce sparse models • In SGD, weights are not updated smoothly Fails to become zero! L 1 penalty is wasted away 14

Cumulative L 1 penalty • The absolute value of the total L 1 penalty which should have been applied to each weight • The total L 1 penalty which has actually been applied to each weight 15

Applying L 1 with cumulative penalty • Penalize each weight according to the difference between and

Implementation 10 lines of code! 17

Experiments • Model: Conditional Random Fields (CRFs) • Baseline: OWL-QN (Andrew and Gao, 2007) • Tasks – Text chunking (shallow parsing) • Co. NLL 2000 shared task data • Recognize base syntactic phrases (e. g. NP, VP, PP) – Named entity recognition • NLPBA 2004 shared task data • Recognize names of genes, proteins, etc. – Part-of-speech (POS) tagging • WSJ corpus (sections 0 -18 for training) 18

Co. NLL 2000 chunking task: objective 19

Co. NLL 2000 chunking: non-zero features 20

Co. NLL 2000 chunking • Performance of the produced model Passes OWL-QN Obj. # Features Time (sec) F-score 160 -1. 583 18, 109 598 93. 62 SGD (Naive) 30 -1. 671 455, 651 1, 117 93. 64 SGD (Clipping + Lazy Update) 30 -1. 671 87, 792 144 93. 65 SGD (Cumulative) 30 -1. 653 28, 189 149 93. 68 SGD (Cumulative + ED) 30 -1. 622 23, 584 148 93. 66 • Training is 4 times faster than OWL-QN • The model is 4 times smaller than the clipping-at-zero approach • The objective is also slightly better 21

NLPBA 2004 named entity recognition Passes OWL-QN Obj. # Features Time (sec) F-score 160 -2. 448 30, 710 2, 253 71. 76 SGD (Naive) 30 -2. 537 1, 032, 962 4, 528 71. 20 SGD (Clipping + Lazy Update) 30 -2. 538 279, 886 585 71. 20 SGD (Cumulative) 30 -2. 479 31, 986 631 71. 40 SGD (Cumulative + ED) 30 -2. 443 25, 965 631 71. 63 Part-of-speech tagging on WSJ Passes OWL-QN Obj. # Features Time (sec) Accuracy 124 -1. 941 50, 870 5, 623 97. 16 SGD (Naive) 30 -2. 013 2, 142, 130 18, 471 97. 18 SGD (Clipping + Lazy Update) 30 -2. 013 323, 199 1, 680 97. 18 SGD (Cumulative) 30 -1. 987 62, 043 1, 777 97. 19 SGD (Cumulative + ED) 30 -1. 954 51, 857 1, 774 97. 17 22

Discussions • Convergence – Demonstrated empirically – Penalties applied are not i. i. d. • Learning rate – The need for tuning can be annoying – Rule of thumb: • Exponential decay (passes = 30, alpha = 0. 85) 23

Conclusions • Stochastic gradient descent training for L 1 regularized log-linear models – Force each weight to receive the total L 1 penalty that would have been applied if the true (noiseless) gradient were available • 3 to 4 times faster than OWL-QN • Extremely easy to implement 24