Decoupling Learning Rates Using Empirical Bayes Sareh Nabi

Decoupling Learning Rates Using Empirical Bayes Sareh Nabi University of Washington Work done while at Amazon Paper in collaboration with: Houssam Nassif, Joseph Hong, Hamed Mamani, Guido Imbens

Biking competition Stop biking pointlessly, join team Circle! • • • Join now! Year-round carousel rides! Unlimited laps! All you can eat Pi! No corners to hide! Hurry, don’t linger around! Check other teams 2

Multiple-slot template Stop biking pointlessly, join team Circle! Slot • • • Join now! Year-round carousel rides! Unlimited laps! All you can eat Pi! No corners to hide! Hurry, don’t linger around! Check other teams 3

Template combinatorics: 48 layouts Stop biking pointlessly, join team Circle! x 2 x 3 • Best way to bicycle! • We cycle and re-cycle! • No point in not joining! Join now! Check other teams x 2 x 2 4

Second-order features Stop biking pointlessly, join team Circle! • • • Join now! Year-round carousel rides! Unlimited laps! All you can eat Pi! No corners to hide! Hurry, don’t linger around! Check other teams 5

Featurization • First-order features, X 1: variant values • E. g. “image 1”, “title 2”, “color 1” • Second-order features, X 2: variant combinations/interactions • E. g. “image 1 AND title 2”, “image 1 AND color 1” • Concatenate for final feature vector representation: X = [X 1, X 2] 6

Decoupling learning rates • 7

Generalized Linear Models (GLM) • 8

Bayesian GLM • 9

Bayesian GLM weight updates • + data = wi, t prior wi, t+1 posterior 10

Empirical Bayes Approach 11

Impose Bayesian hierarchy • 12

Model assumptions • 13

Empirical prior estimation • 14

Variance intuition • Between-features variance Within-features variance Estimation noise 15

Empirical Bayes (EB) algorithm • 16

Experiments and Results 17

Experimental setup • 18

When to compute empirical prior • Longer reset time helps more • Empirical prior helps, even with one batch 19

Lasting effect on small batches 20

21

Live experiment 22

Discussion • 23

Takeaways • Empirical Bayes prior helps, mostly on low/medium traffic • Promising, as low traffic cases are hardest to optimize • Effectively decouples learning rates • Model converges faster • Works for any feature grouping • Applies to contextual and personalization use cases 24

Thank you! Questions? Comments? 25