Path Analysis Sequence Modelling Regularization of RNN through

  • Slides: 22
Download presentation
Path Analysis & Sequence Modelling Regularization of RNN through Bayesian Neural Networks Vish Hawa

Path Analysis & Sequence Modelling Regularization of RNN through Bayesian Neural Networks Vish Hawa Principal Scientist, Vanguard April 18 , 2019 1 1

PROBLEM SETUP • Marketing leads are exposed numerous channels, it is difficult to isolate

PROBLEM SETUP • Marketing leads are exposed numerous channels, it is difficult to isolate how much credit should be attributed towards each channel. This can have big impact on how much budget /efforts should be put into any particular channel • Usage of First Touch or Last Touch Attribution - can potentially overweigh on first or last touch channels and be sub-optimal in credit assignments • Are there sequences that are better at driving leads than others? Build model and Conduct Path Analysis and Produce weights to channel contribution for each path • RNNs have shown great promises on sequence-modelling (can they solve memorizing issues especially when data size is limited) 2

Marketing Channels SEQUENCE COMBINATIONS Touch. Points ADV EMAL WEB SR A MAIL ADV WEB

Marketing Channels SEQUENCE COMBINATIONS Touch. Points ADV EMAL WEB SR A MAIL ADV WEB PHONE SR B MAIL ADV WEB PHONE ADV C Touch Point Facts • • Touches range from 1 – 100+ , median 15 touches Possible space: 715 ~ 5 T Paths! (ordered sequence) Even sequence length of 10 ~ 300 M paths! Very sparse evidence 3

RNN AS SEQUENCE MODELER RNN • Stacking : 2 –layers • Seq. Length: 15

RNN AS SEQUENCE MODELER RNN • Stacking : 2 –layers • Seq. Length: 15 • Dense: 2 -layer with classifier • Optimizer: RMSprop 0. 4 • Loss: binary-cross-entropy 0. 2 • IN: 5 -dim embedding of channels • 240 Units of GRU Cells • Dropout = 0. 3 1 0. 8 0. 6 0 ADV EM RNN PHONE EM ADV 40% ADV - EM 93% ADV – EM - PHONE 20% ADV – EM – PHONE - EM 80% Increase of ~ 2 folds !! Drops to 20% ~ up again 4 x 4

RNN AS SEQUENCE MODELER RNN 1 Points - of Discontinuities Touch-Points 0. 8 0.

RNN AS SEQUENCE MODELER RNN 1 Points - of Discontinuities Touch-Points 0. 8 0. 6 “Modeling of Symptoms ” 0. 4 0. 2 0 ADV EM PHONE EM Extremities • More skewed the strength of evidence - higher the discontinuity • Higher dropout rate helps – may not solve it • True features (z) are latent unobservable • can we do better ? . . Prominent when data size is small compared to possibilities O N H -P AD V- EM EM V- Yes AD No • E. E O N -P AD H V- AD V EM 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 5

BAYSIAN NEURAL NETWORK AND LOSS FUNCTION z : latent variable x : new data

BAYSIAN NEURAL NETWORK AND LOSS FUNCTION z : latent variable x : new data point p(k) : predictive probability of the class n : sampled points 6

ARCHITECTURE: DISENTAGLED VARIATIONAL ENCODER Reparametrized Latent Space (z) sampled Input Layer Dense layers Embedding

ARCHITECTURE: DISENTAGLED VARIATIONAL ENCODER Reparametrized Latent Space (z) sampled Input Layer Dense layers Embedding layer STACKED RNN CLASSIFIER Information Flow 7

BAYESIAN INFERENCE QUERIES Inference on Bayesian networks consists of conditional probability (CPQ) given some

BAYESIAN INFERENCE QUERIES Inference on Bayesian networks consists of conditional probability (CPQ) given some evidence (Marginal) Maximum a posteriori (MAP) queries {finding configuration q* of variables/nodes that has highest posterior probability} Path: ADV/DM-PH-PAS ADV • Rescue: Regularize through PGMs DM Web PH Client Y/N • Make use of conditional probabilities ~ BNs • Solution: for each ‘path’ calculate probabilistic estimates CPQ: P(Client =Y |Web=0) 8

REGULARIZED MODEL RNN + Regularization No ADV – EM 40% ADV – EM –

REGULARIZED MODEL RNN + Regularization No ADV – EM 40% ADV – EM – PHONE 25% ADV – EM – PHONE - EM 30% Yes E. E N N O H -P EM V- EM AD PHONE RNN/BN V- EM RNN AD ADV AD 0 V- AD V 0. 2 O 0. 4 H 0. 6 EM -P 0. 8 . . 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% EM 1 Maintains consistent range over probabilities 9

UNCERTAINITY ESTIMATES ADV : mean 0. 25 , var = 20% ADV – EM

UNCERTAINITY ESTIMATES ADV : mean 0. 25 , var = 20% ADV – EM – PHONE : mean 0. 25 var = 10% ADV – EM : mean 0. 4 var = 15% ADV – EM – PHONE – EM: mean 0. 3 var = 5% 10

LATENT SPACE VECTORS WO A PH M-PH PH-E B EM ADV Inferences drawn: •

LATENT SPACE VECTORS WO A PH M-PH PH-E B EM ADV Inferences drawn: • • ‘Next Best Action’ – most influential channel touch to provide Optimal sequence - sequence of touches for maximum effect towards desired objective (* may require optimization procedure) 11

TAKE AWAYS 12

TAKE AWAYS 12

REFERENCES • Bayesian Neural Nets • https: //arxiv. org/abs/1505. 05424 [arxiv. org] - Info.

REFERENCES • Bayesian Neural Nets • https: //arxiv. org/abs/1505. 05424 [arxiv. org] - Info. VAE: Information Maximizing Variational Autoencoders • https: //arxiv. org/pdf/1706. 02262. pdf • Understanding disentangling in β-VAE • https: //arxiv. org/abs/1804. 03599 • Auto-Encoding Variational Bayes • https: //arxiv. org/abs/1312. 6114 • http: //papers. nips. cc/paper/7141 -what-uncertainties-do-we-need-in-bayesian-deep -learning-for-computer-vision. pdf • ELBO Surgery • http: //approximateinference. org/accepted/Hoffman. Johnson 2016. pdf 13 13

THANK YOU 14

THANK YOU 14

APPENDIX 15

APPENDIX 15

COMBINED: RNN + BAYES-NET • Query the network for conditional probability queries CPQ •

COMBINED: RNN + BAYES-NET • Query the network for conditional probability queries CPQ • BN = for each do P(Client=1 | path/sequence) • Compute Label as dot product: Client*BN • RNN Loss function: -Σi Client*BN log(y. Hat ) • Truncated sequence for better generalization k{5, 10} ID CLIENT N SEQUENCE (k=5) 1000007488 1 4 EMAL Web Web 0. 41 0. 66 1000038000 1 15 EMAL ADV Web Web 0. 28 0. 43 1000041267 1 11 ADV EMAL Web Web 0. 11 0. 41 BN+ MAE(Client*B+) Score (RNN) Ordered path score: Regularized by BN and computed by RNN based on evidence observed 16

FEATURE SELECTION STACKED RNN Differential Evolution Optimization Dense layer Embedding layer Flow ADV BDG

FEATURE SELECTION STACKED RNN Differential Evolution Optimization Dense layer Embedding layer Flow ADV BDG EM PAS PH WO DM Gaussian BAYSIEAN NETWORK DOT PRODUCT Bayesian Score –> Dot Product –> Ordered Score –> Optimal Path Selection Web 17

STACKED RNN/LSTM • RNNs will memorize sequences • Worse when data size is small

STACKED RNN/LSTM • RNNs will memorize sequences • Worse when data size is small compared with possible space • Similar performance in ‘out of sample’ cases POID CLIENT COUNTS SEQUENCE (k=5) 1000007488 0 4 EMAL Web Web 0. 09 1000038000 0 15 EMAL ADV Web Web 0. 43 1000041267 1 11 ADV EMAL Web Web 0. 00 Score Approach: External Regularization can perform better than tune-up 18

 • Stacking : 2 –layers • Seq. Length: 5 • Dense: 2 -layer

• Stacking : 2 –layers • Seq. Length: 5 • Dense: 2 -layer with linear out • Optimizer: RMSprop • Loss: MAE • IN: 5 -dim embedding of channels • 64 Units of LSTM Cells • Dropout = 0. 3 Bayes Rule : Posterior α Likelihood * Prior 19

Illustration: SEQUENCE OF TOUCHPOINTS Touch. Points ADV EMAL WEB BDG A MAIL ADV WEB

Illustration: SEQUENCE OF TOUCHPOINTS Touch. Points ADV EMAL WEB BDG A MAIL ADV WEB OFFER WEB PHONE BDG B MAIL ADV WEB PHONE ADV C t es r e t In g n lli Fa tes Ra PATH FUN FACTS tor s ve In s g t in rg imen u S nt Se t ke r Ma w tility o L la Vo • Over 200 K path combinations observed • Touches range from 1 -50+ , mostly 15 touches or less • Possible space: 1215 Paths! (the universe) Clients can be sensitive to Channels types and macro-economic climate and their own changing needs 20

COMBINED ARCHITECTURE STACKED RNN BAYSIEAN NETWORK ADV BDG EM Client Y/N PH WO DM

COMBINED ARCHITECTURE STACKED RNN BAYSIEAN NETWORK ADV BDG EM Client Y/N PH WO DM Web Dense layer GAUSSIAN NET DOT PRODUCT Embedding layer Information Flow Bayesian Score –> Dot Product –> Ordered Score {Optimal Path Selection} 21

SECTIONS 22 22

SECTIONS 22 22