Generative Topic Models for Community Analysis Ramesh Nallapati

























![Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] • Select document d ~ Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] • Select document d ~](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-26.jpg)
![Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] PLSA likelihood: d d z Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] PLSA likelihood: d d z](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-27.jpg)
![Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] Heuristic: (1 - ) 0 Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] Heuristic: (1 - ) 0](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-28.jpg)
![Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] • Experiments: Text Classification • Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] • Experiments: Text Classification •](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-29.jpg)
![Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] • Classification performance Hyperlink 9/18/2007 Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] • Classification performance Hyperlink 9/18/2007](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-30.jpg)

![Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004] • For each document d Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004] • For each document d](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-32.jpg)
![Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004] 9/18/2007 10 -802: Guest Lecture Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004] 9/18/2007 10 -802: Guest Lecture](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-33.jpg)

![Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a P • Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a P •](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-35.jpg)
![Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a Learning: Gibbs Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a Learning: Gibbs](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-36.jpg)
![Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Perplexity results Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Perplexity results](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-37.jpg)
![Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Topic-Author visualization Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Topic-Author visualization](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-38.jpg)
![Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Application 1: Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Application 1:](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-39.jpg)
![Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Application 2: Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Application 2:](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-40.jpg)
![Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] 9/18/2007 10 -802: Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] 9/18/2007 10 -802:](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-41.jpg)
![Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] Gibbs sampling 9/18/2007 Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] Gibbs sampling 9/18/2007](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-42.jpg)
![Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] • Datasets – Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] • Datasets –](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-43.jpg)
![Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] • Topic Visualization: Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] • Topic Visualization:](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-44.jpg)
![Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] • Topic Visualization: Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] • Topic Visualization:](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-45.jpg)
![Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] 9/18/2007 10 -802: Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] 9/18/2007 10 -802:](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-46.jpg)

![Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Copycat 9/18/2007 model 10 -802: Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Copycat 9/18/2007 model 10 -802:](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-48.jpg)
![Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Citation influence model 9/18/2007 10 Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Citation influence model 9/18/2007 10](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-49.jpg)
![Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Citation influence graph for LDA Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Citation influence graph for LDA](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-50.jpg)
![Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Words 9/18/2007 in LDA paper Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Words 9/18/2007 in LDA paper](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-51.jpg)
![Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Performance evaluation – Data: • Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Performance evaluation – Data: •](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-52.jpg)
![Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Results 9/18/2007 10 -802: Guest Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Results 9/18/2007 10 -802: Guest](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-53.jpg)
![Mixed membership Stochastic Block models [Work In Progress] • A complete generative model for Mixed membership Stochastic Block models [Work In Progress] • A complete generative model for](https://slidetodoc.com/presentation_image_h2/46aa68ca27f9e1f0fa19bac14a7545ad/image-54.jpg)

- Slides: 55
Generative Topic Models for Community Analysis Ramesh Nallapati 9/18/2007 10 -802: Guest Lecture
Objectives • Provide an overview of topic models and their learning techniques – Mixture models, PLSA, LDA – EM, variational EM, Gibbs sampling • Convince you that topic models are an attractive framework for community analysis – 5 definitive papers 9/18/2007 10 -802: Guest Lecture 2
Outline • Part I: Introduction to Topic Models – Naive Bayes model – Mixture Models • Expectation Maximization – PLSA – LDA • Variational EM • Gibbs Sampling • Part II: Topic Models for Community Analysis – – – 9/18/2007 Citation modeling with PLSA Citation Modeling with LDA Author Topic Model Author Topic Recipient Modeling influence of Citations Mixed membership Stochastic Block Model 10 -802: Guest Lecture 3
Introduction to Topic Models • Multinomial Naïve Bayes • For each document d = 1, , M • Generate Cd ~ Mult( ¢ | ) C • For each position n = 1, , Nd W 1 W 2 W 3 …. . • Generate wn ~ Mult(¢| , Cd) WN M 9/18/2007 10 -802: Guest Lecture 4
Introduction to Topic Models • Naïve Bayes Model: Compact representation C C W 1 W 2 W 3 …. . WN M W N M 9/18/2007 10 -802: Guest Lecture 5
Introduction to Topic Models • Multinomial naïve Bayes: Learning – Maximize the log-likelihood of observed variables w. r. t. the parameters: • Convex function: global optimum • Solution: 9/18/2007 10 -802: Guest Lecture 6
Introduction to Topic Models • Mixture model: unsupervised naïve Bayes model • Joint probability of words and classes: C Z • But classes are not visible: W N M 9/18/2007 10 -802: Guest Lecture 7
Introduction to Topic Models • Mixture model: learning – Not a convex function • No global optimum solution – Solution: Expectation Maximization • Iterative algorithm • Finds local optimum • Guaranteed to maximize a lower-bound on the log-likelihood of the observed data 9/18/2007 10 -802: Guest Lecture 8
Introduction to Topic Models log(0. 5 x 1+0. 5 x 2) • Quick summary of EM: – Log is a concave function X 1 0. 5 log(x 1)+0. 5 log(x 2) X 2 0. 5 x 1+0. 5 x 2 H( ) – Lower-bound is convex! – Optimize this lower-bound w. r. t. each variable instead 9/18/2007 10 -802: Guest Lecture 9
Introduction to Topic Models • Mixture model: EM solution E-step: M-step: 9/18/2007 10 -802: Guest Lecture 10
Introduction to Topic Models 9/18/2007 10 -802: Guest Lecture 11
Introduction to Topic Models • Probabilistic Latent Semantic Analysis Model d d • Select document d ~ Mult( ) • For each position n = 1, , Nd • generate zn ~ Mult( ¢ | d) z Topic distribution • generate wn ~ Mult( ¢ | zn) w N M 9/18/2007 10 -802: Guest Lecture 12
Introduction to Topic Models • Probabilistic Latent Semantic Analysis Model – Learning using EM – Not a complete generative model • Has a distribution over the training set of documents: no new document can be generated! – Nevertheless, more realistic than mixture model • Documents can discuss multiple topics! 9/18/2007 10 -802: Guest Lecture 13
Introduction to Topic Models • PLSA topics (TDT-1 corpus) 9/18/2007 10 -802: Guest Lecture 14
Introduction to Topic Models 9/18/2007 10 -802: Guest Lecture 15
Introduction to Topic Models • Latent Dirichlet Allocation • For each document d = 1, , M • Generate d ~ Dir(¢ | ) • For each position n = 1, , Nd z • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn) w N M 9/18/2007 10 -802: Guest Lecture 16
Introduction to Topic Models • Latent Dirichlet Allocation – Overcomes the issues with PLSA • Can generate any random document – Parameter learning: • Variational EM – Numerical approximation using lower-bounds – Results in biased solutions – Convergence has numerical guarantees • Gibbs Sampling – Stochastic simulation – unbiased solutions – Stochastic convergence 9/18/2007 10 -802: Guest Lecture 17
Introduction to Topic Models • Variational EM for LDA – Approximate the posterior by a simpler distribution • A convex function in each parameter! 9/18/2007 10 -802: Guest Lecture 18
Introduction to Topic Models • Gibbs sampling – Applicable when joint distribution is hard to evaluate but conditional distribution is known – Sequence of samples comprises a Markov Chain – Stationary distribution of the chain is the joint distribution 9/18/2007 10 -802: Guest Lecture 19
Introduction to Topic Models • LDA topics 9/18/2007 10 -802: Guest Lecture 20
Introduction to Topic Models • LDA’s view of a document 9/18/2007 10 -802: Guest Lecture 21
Introduction to Topic Models • Perplexity comparison of various models Unigram Mixture mode l PLSA Lower is better 9/18/2007 LDA 10 -802: Guest Lecture 22
Introduction to Topic Models • Summary – Generative models for exchangeable data – Unsupervised models – Automatically discover topics – Well developed approximate techniques available for inference and learning 9/18/2007 10 -802: Guest Lecture 23
Outline • Part I: Introduction to Topic Models – Naive Bayes model – Mixture Models • Expectation Maximization – PLSA – LDA • Variational EM • Gibbs Sampling • Part II: Topic Models for Community Analysis – – – 9/18/2007 Citation modeling with PLSA Citation Modeling with LDA Author Topic Model Author Topic Recipient Modeling influence of Citations Mixed membership Stochastic Block Model 10 -802: Guest Lecture 24
Hyperlink modeling using PLSA 9/18/2007 10 -802: Guest Lecture 25
Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] • Select document d ~ Mult( ) d d • For each position n = 1, , Nd • generate zn ~ Mult( ¢ | d) z • generate wn ~ Mult( ¢ | zn) z • For each citation j = 1, , Ld w • generate zj ~ Mult( ¢ | d) c N • generate cj ~ Mult( ¢ | zj) L M 9/18/2007 10 -802: Guest Lecture 26
Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] PLSA likelihood: d d z z w c N New likelihood: L M 9/18/2007 Learning using EM 10 -802: Guest Lecture 27
Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] Heuristic: (1 - ) 0 · · 1 determines the relative importance of content and hyperlinks 9/18/2007 10 -802: Guest Lecture 28
Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] • Experiments: Text Classification • Datasets: – Web KB • 6000 CS dept web pages with hyperlinks • 6 Classes: faculty, course, student, staff, etc. – Cora • 2000 Machine learning abstracts with citations • 7 classes: sub-areas of machine learning • Methodology: – Learn the model on complete data and obtain d for each document – Test documents classified into the label of the nearest neighbor in training set – Distance measured as cosine similarity in the space – Measure the performance as a function of 9/18/2007 10 -802: Guest Lecture 29
Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] • Classification performance Hyperlink 9/18/2007 content Hyperlink 10 -802: Guest Lecture content 30
Hyperlink modeling using LDA 9/18/2007 10 -802: Guest Lecture 31
Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004] • For each document d = 1, , M • Generate d ~ Dir(¢ | ) z • For each position n = 1, , Nd z • generate zn ~ Mult( ¢ | d) • generate wn ~ Mult( ¢ | zn) w N L M 9/18/2007 • For each citation j = 1, , Ld c • generate zj ~ Mult(. | d) • generate cj ~ Mult(. | zj) Learning using variational EM 10 -802: Guest Lecture 32
Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004] 9/18/2007 10 -802: Guest Lecture 33
Author-Topic Model for Scientific Literature 9/18/2007 10 -802: Guest Lecture 34
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a P • For each author a = 1, , A x • Generate a ~ Dir(¢ | ) • For each topic k = 1, , K • Generate fk ~ Dir( ¢ | ) z A • For each document d = 1, , M • For each position n = 1, , Nd • Generate author x ~ Unif(¢ | ad) w N • generate zn ~ Mult( ¢ | a) M f 9/18/2007 K • generate wn ~ Mult( ¢ | fzn) 10 -802: Guest Lecture 35
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a Learning: Gibbs sampling P x z A w N M f 9/18/2007 K 10 -802: Guest Lecture 36
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Perplexity results 9/18/2007 10 -802: Guest Lecture 37
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Topic-Author visualization 9/18/2007 10 -802: Guest Lecture 38
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Application 1: Author similarity 9/18/2007 10 -802: Guest Lecture 39
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] • Application 2: Author entropy 9/18/2007 10 -802: Guest Lecture 40
Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] 9/18/2007 10 -802: Guest Lecture 41
Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] Gibbs sampling 9/18/2007 10 -802: Guest Lecture 42
Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] • Datasets – Enron email data • 23, 488 messages between 147 users – Mc. Callum’s personal email • 23, 488(? ) messages with 128 authors 9/18/2007 10 -802: Guest Lecture 43
Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] • Topic Visualization: Enron set 9/18/2007 10 -802: Guest Lecture 44
Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] • Topic Visualization: Mc. Callum’s data 9/18/2007 10 -802: Guest Lecture 45
Author-Topic-Recipient model for email data [Mc. Callum, Corrada-Emmanuel, Wang, ICJAI’ 05] 9/18/2007 10 -802: Guest Lecture 46
Modeling Citation Influences 9/18/2007 10 -802: Guest Lecture 47
Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Copycat 9/18/2007 model 10 -802: Guest Lecture 48
Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Citation influence model 9/18/2007 10 -802: Guest Lecture 49
Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Citation influence graph for LDA paper 9/18/2007 10 -802: Guest Lecture 50
Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Words 9/18/2007 in LDA paper assigned to citations 10 -802: Guest Lecture 51
Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Performance evaluation – Data: • 22 seed papers and 132 cited papers • Users labeled citations on a scale of 1 -4 – Models considered: • Citation influence model • Copy cat model • LDA-JS-divergence – Symmetric Divergence in topic space • LDA-post where • Page Rank • TF-IDF – Evaulation measure: • Area under the ROC curve 9/18/2007 10 -802: Guest Lecture 52
Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] • Results 9/18/2007 10 -802: Guest Lecture 53
Mixed membership Stochastic Block models [Work In Progress] • A complete generative model for text and citations • Can model the topicality of citations – Topic Specific Page. Rank • Can also predict citations between unseen documents 9/18/2007 10 -802: Guest Lecture 54
Summary • Topic Modeling is an interesting, new framework for community analysis – Sound theoretical basis – Completely unsupervised – Simultaneous modeling of multiple fields – Discovers “soft”-communities and clusters in terms of “topic” membership – Can also be used for predictive purposes 9/18/2007 10 -802: Guest Lecture 55