Generative Topic Models Outline Naive Bayes model Mixture

Generative Topic Models

Outline • • Naive Bayes model Mixture Models PLSA LDA 2 / 57

Multinomial Naïve Bayes • For each document d = 1, , M • Generate Cd ~ Mult( ¢ | ) C • For each position n = 1, , Nd • Generate wn ~ Mult(¢| , Cd) W 1 W 2 W 3 …. . WN M 3 / 57

Multinomial Naïve Bayes • Naïve Bayes Model: Compact representation C C W 1 W 2 W 3 …. . WN M W N M 4 / 57

Mixture Model • Mixture model: unsupervised naïve Bayes model • Joint probability of words and classes: C Z • But classes are not visible: W N M 5 / 57

PLSA Topic Models • Probabilistic Latent Semantic Analysis Model d d • Select document d ~ Mult( ) • For each position n = 1, , Nd • generate zn ~ Mult( ¢ | d) z • generate wn ~ Mult( ¢ | zn) Topic distribution w N M 6 / 57

PLSA Topic Models • Probabilistic Latent Semantic Analysis Model – Learning using EM – Not a complete generative model • Has a distribution over the training set of documents: no new document can be generated! – Nevertheless, more realistic than mixture model • Documents can discuss multiple topics! 7 / 57

PLSA Topic Models • PLSA topics (TDT-1 corpus) 8 / 57

LDA Topic Models 9 / 57

LDA Topic Models • Latent Dirichlet Allocation • For each document d = 1, , M • Generate d ~ Dir(¢ | ) • For each position n = 1, , Nd z • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn) w N M 10 / 57

Introduction to Topic Models • Latent Dirichlet Allocation – Overcomes the issues with PLSA • Can generate any random document – Parameter learning: • Variational EM – Numerical approximation using lower-bounds – Results in biased solutions – Convergence has numerical guarantees • Gibbs Sampling – Stochastic simulation – unbiased solutions – Stochastic convergence 11 / 57