Modeling Documents Amruta Joshi Department of Computer Science

Outline n Topic Models ¨ Topic Extraction 2 ¨ Author Information ¨ Modeling Topics

Motivation Identifying content of a document n Identifying its latent structure n n More

Topics & Authors n Why model topics? ¨ Observe topic trends ¨ How documents

Topic Extraction: Overview n Supervised Learning Techniques ¨ Learn from labeled document collection n

Topic Extraction: Overview n Dimensionality Reduction Represent documents in Vector Space of terms n

Topic Extraction: Overview n Cluster documents on semantic content ¨ Typically, each cluster has

Author Information: Overview n Analyzing text using n n Stylometry n statistical analysis using

Author Information: Overview n Graph-based models D 1 D 2 ¨ Build Interactive Referral.

The Big Idea n Topic Model n n Author Model n n Model topics

Bayesian Networks Pneumonia Tuberculosis nodes = random variables edges = direct probabilistic influence Lung

Bayesian Networks Pneumonia Tuberculosis Lung Infiltrates XRay Sputum Smear P T P(I |P, T

BN Learning P Data n Inducer T I X S BN models can be

Generative Model Probabilistic Generative Process Mixture components Mixture weights Amruta Joshi, Stanford Univ. Statistical

Bayesian Network for modeling document generation Doc 1 T 1 … T 2 TT

Topic Model: Plate Notation Document specific distribution over topics Topic distribution over words Topic

Topic Model: Geometric Representation Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter.

Modeling Authors with words Uniform distribution over authors of doc Document ad Distribution of

Author-Topic Model Uniform distribution of documents over authors Document ad Author Distribution of authors

Inference n Expectation Maximization n n But poor results (local Maxima) Gibbs Sampling ¨

Inference and Learning for Documents Prob. that ith topic is assigned to topic j

Matrix Factorization Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 22

Topic Model: Inference River Stream Bank Money Loan documents Can we recover the original

Example of Gibbs Sampling n Assign word tokens randomly to topics (●=topic 1; ●=topic

After 1 iteration n Apply sampling equation to each word token River Stream Bank

After 4 iterations River Stream Bank Money Loan Slide Credit: Padhraic Smyth, UC Irvine

After 32 iterations ● River Stream Bank ● Money Loan Slide Credit: Padhraic Smyth,

Results n Tested on Scientific Papers ¨ NIPS Dataset n n n V=13, 649

Evaluating Predictive Power n Perplexity ¨ Indicates ability to predict words on new unseen

Results: Perplexity Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 30

Recap n First Author Model n Topic Model n n Then n n Author-Topic

Integrating topics & syntax n Probabilistic Models ¨ Short-range dependencies n n n Syntactic

How to integrate these? n Mixture of Models n n Product of Models n

The Composite Model 1 n Capturing asymmetry ¨ Replace probability distribution over words with

Generating phrases 0. 9 in with for on. . . 0. 5 0. 4

The Composite Model 2 (Graphical) Doc’s distribution over topics Topics z 1 z 2

The Composite Model 3 n (d) : document’s distribution over topics Transitions between classes

Results n Tested on ¨ Brown corpus (tagged with word types) ¨ Concatenated Brown

Results n Identifying Syntactic classes & semantic topics ¨ Clean separation observed n Identifying

Extensions to Topic Model Integrating link information (Cohn, Hofmann 2001) n Learning Topic Hierarchies

Conclusion n Identifying its latent structure n Document Content is modeled for ¨ Semantic

Acknowledgements n Prof. Rajeev Motwani n Advice and guidance regarding topic selection n T.

Thank you! Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 43

References n Primary ¨ Steyvers, M. , Smyth, P. , Rosen-Zvi, M. , &

Slides: 44

Download presentation

Modeling Documents Amruta Joshi Department of Computer Science Stanford University 6 th June 2005 Research in Algorithms for the Inter. Net 1

Outline n Topic Models ¨ Topic Extraction 2 ¨ Author Information ¨ Modeling Topics ¨ Modeling Authors ¨ Author Topic Model ¨ Inference n Integrating topics and syntax ¨ Probabilistic Models ¨ Composite Model ¨ Inference Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 2

Motivation Identifying content of a document n Identifying its latent structure n n More specifically ¨ Given a collection of documents we want to create a model to collect information about Authors n Topics n Syntactic constructs n Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 3

Topics & Authors n Why model topics? ¨ Observe topic trends ¨ How documents relate to one-another ¨ Tagging abstracts n Why model authors’ interests? ¨ Identifying what author writes about ¨ Identifying authors with similar interests ¨ Authorship attribution ¨ Creating reviewer lists ¨ Finding unusual work by an author Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 4

Topic Extraction: Overview n Supervised Learning Techniques ¨ Learn from labeled document collection n But Unlabeled documents, Rapidly changing fields (Yang 1998) Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net rivers In floods, the banks of a river overflow 5

Topic Extraction: Overview n Dimensionality Reduction Represent documents in Vector Space of terms n Map to low-dimensionality n Non-linear dim. reduction § WEBSOM (Lagus et. al. 1999) ¨ Linear Projection § LSI (Berry, Dumais, O’Brien 1995) ¨ n Regions represent topics Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 6

Topic Extraction: Overview n Cluster documents on semantic content ¨ Typically, each cluster has just 1 topic n Aspect Model ¨ Topic modeled as distribution over words ¨ Documents generated from multiple topics Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 7

Author Information: Overview n Analyzing text using n n Stylometry n statistical analysis using literary style, frequency of word usage, etc Semantics n Content of document Amruta Joshi, Stanford Univ. As doth the lion in the Capitol, A man no mightier than thyself or me … Research in Algorithms for the Inter. Net 8

Author Information: Overview n Graph-based models D 1 D 2 ¨ Build Interactive Referral. Web using citations n D 3 D 4 Kautz, Selman, Shah 1997 ¨ Build Co-Author Graphs White & Smith n Page-Rank for analysis n Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 9

The Big Idea n Topic Model n n Author Model n n Model topics as distribution over words Model author as distribution over words Author-Topic Model Probabilistic Model for both n Model topics as distribution over words n Model authors as distribution over topics n Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 10

Bayesian Networks Pneumonia Tuberculosis nodes = random variables edges = direct probabilistic influence Lung Infiltrates XRay Sputum Smear Topology captures independence: XRay conditionally independent of Pneumonia given Infiltrates Slide Credit: Lisa Getoor, UMD College Park Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 11

Bayesian Networks Pneumonia Tuberculosis Lung Infiltrates XRay Sputum Smear P T P(I |P, T ) p t 0. 7 0. 3 p t 0. 6 0. 4 p t 0. 2 0. 8 p t 0. 01 0. 99 ¨ Associated with each node Xi there is a conditional probability distribution P(Xi|Pai: ) — distribution over Xi for each assignment to parents If variables are discrete, P is usually multinomial n P can be linear Gaussian, mixture of Gaussians, … n Slide Credit: Lisa Getoor, UMD College Park Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 12

BN Learning P Data n Inducer T I X S BN models can be learned from empirical data ¨ parameter estimation via numerical optimization ¨ structure learning via combinatorial search. Slide Credit: Lisa Getoor, UMD College Park Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 13

Generative Model Probabilistic Generative Process Mixture components Mixture weights Amruta Joshi, Stanford Univ. Statistical Inference Bayesian approach: use priors Mixture weights ~ Dirichlet( a ) Mixture components ~ Dirichlet( b ) Research in Algorithms for the Inter. Net 14

Bayesian Network for modeling document generation Doc 1 T 1 … T 2 TT Z Z w 1 w 2 … wv W Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net W 15

Topic Model: Plate Notation Document specific distribution over topics Topic distribution over words Topic z w T Amruta Joshi, Stanford Univ. Document Word Nd Research in Algorithms for the Inter. Net D 16

Topic Model: Geometric Representation Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 17

Modeling Authors with words Uniform distribution over authors of doc Document ad Distribution of authors over words x w A Amruta Joshi, Stanford Univ. Author Word Nd Research in Algorithms for the Inter. Net D 18

Author-Topic Model Uniform distribution of documents over authors Document ad Author Distribution of authors over topics x Topic z A Topic distribution over words w T Amruta Joshi, Stanford Univ. Word Nd Research in Algorithms for the Inter. Net D 19

Inference n Expectation Maximization n n But poor results (local Maxima) Gibbs Sampling ¨ Parameters: , ¨ Start with initial random assignment ¨ Update parameter using other parameters ¨ Converges after ‘n’ iterations ¨ Burn-in time Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 20

Inference and Learning for Documents Prob. that ith topic is assigned to topic j keeping other topic assn unchanged # of times word m is assigned to topic j Amruta Joshi, Stanford Univ. mj Research in Algorithms for the Inter. Net # of times topic j has occurred in document d dj 21

Matrix Factorization Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 22

Topic Model: Inference River Stream Bank Money Loan documents Can we recover the original topics and topic mixtures from this data? Slide Credit: Padhraic Smyth, UC Irvine Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 23

Example of Gibbs Sampling n Assign word tokens randomly to topics (●=topic 1; ●=topic 2 ) River Stream Bank Money Loan Slide Credit: Padhraic Smyth, UC Irvine Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 24

After 1 iteration n Apply sampling equation to each word token River Stream Bank Money Loan Slide Credit: Padhraic Smyth, UC Irvine Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 25

After 4 iterations River Stream Bank Money Loan Slide Credit: Padhraic Smyth, UC Irvine Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 26

After 32 iterations ● River Stream Bank ● Money Loan Slide Credit: Padhraic Smyth, UC Irvine Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 27

Results n Tested on Scientific Papers ¨ NIPS Dataset n n n V=13, 649 D=1, 740 K=2, 037 #Topics = 100 #tokens = 2, 301, 375 ¨ Cite. Seer Dataset n n n V=30, 799 D=162, 489 K=85, 465 #Topics = 300 #tokens = 11, 685, 514 Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 28

Evaluating Predictive Power n Perplexity ¨ Indicates ability to predict words on new unseen documents Lower the better Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 29

Results: Perplexity Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 30

Recap n First Author Model n Topic Model n n Then n n Author-Topic Model Next… n Integrating Topics & Syntax Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 31

Integrating topics & syntax n Probabilistic Models ¨ Short-range dependencies n n n Syntactic Constraints Represented as distinct syntactic classes HMM, Probabilistic CFGs ¨ Long-range dependencies n n Semantic Constraints Represented as probabilistic distribution Bayes Model, Topic Model New Idea! Use both Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 32

How to integrate these? n Mixture of Models n n Product of Models n n Each word exhibits either short or long range dependencies Each word exhibits both short or long range dependencies Composite Model Asymmetric n All words exhibit short-range dependencies n Subset of words exhibit long-range Research in Algorithms for the Inter. Net Amruta Joshi, Stanford Univ. dependencies n 33

The Composite Model 1 n Capturing asymmetry ¨ Replace probability distribution over words with semantic model ¨ Syntactic model chooses when to emit content word ¨ Semantic model chooses which word to emit n Methods ¨ Syntactic component is HMM ¨ Semantic component is Topic model Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 34

Generating phrases 0. 9 in with for on. . . 0. 5 0. 4 0. 1 network neural output networks. . . images objects. . . kernel support svm vector. . . 0. 9 0. 2 0. 7 used trained obtained described. . . network used for images image obtained with kernel output described with objects neural network trained with svm images Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 35

The Composite Model 2 (Graphical) Doc’s distribution over topics Topics z 1 z 2 z 3 z 4 Words w 1 w 2 w 3 w 4 Classes c 1 Amruta Joshi, Stanford Univ. c 2 c 3 c 4 Research in Algorithms for the Inter. Net 36

The Composite Model 3 n (d) : document’s distribution over topics Transitions between classes ci-1 and ci follow distribution (Ci-1) n A document is generated as: n ¨ For each word wi in document d n Draw zi from (d) n Draw ci from (Ci-1) n If ci=1, then draw wi from (zi), n else draw wi from (ci) Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 37

Results n Tested on ¨ Brown corpus (tagged with word types) ¨ Concatenated Brown & TASA corpus n HMM & Topic Model ¨ 20 Classes n start/end Markers Class + 19 classes ¨ T = 200 Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 38

Results n Identifying Syntactic classes & semantic topics ¨ Clean separation observed n Identifying function words & content words ¨ “control” : plain verb (syntax) or semantic word n Part-of-Speech Tagging ¨ Identifying syntactic class n Document Classification ¨ Brown corpus: 500 docs => 15 groups ¨ Results similar to plain Topic Model Research in Algorithms for the Inter. Net Amruta Joshi, Stanford Univ. 39

Extensions to Topic Model Integrating link information (Cohn, Hofmann 2001) n Learning Topic Hierarchies n Integrating Syntax & Topics n Integrate authorship info with content (author-topic model) n Grade-of-membership Models n Random sentence generation n Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 40

Conclusion n Identifying its latent structure n Document Content is modeled for ¨ Semantic Associations – topic model ¨ Authorship - author topic model ¨ Syntactic Constructs – HMM Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 41

Acknowledgements n Prof. Rajeev Motwani n Advice and guidance regarding topic selection n T. K. Satish Kumar n Help on Probabilistic Models Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 42

Thank you! Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 43

References n Primary ¨ Steyvers, M. , Smyth, P. , Rosen-Zvi, M. , & Griffiths, T. (2004). Probabilistic Author-Topic Models for Information Discovery. The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, Washington. ¨ Steyvers, M. & Griffiths, T. Probabilistic topic models. (http: //psiexp. ss. uci. edu/research/papers/Steyvers. Griffiths. LSABook. Formatted. pdf) ¨ Rosen-Zvi, M. , Griffiths T. , Steyvers, M. , & Smyth, P. (2004). The Author. Topic Model for Authors and Documents. In 20 th Conference on Uncertainty in Artificial Intelligence. Banff, Canada ¨ Griffiths, T. L. , & Steyvers, M. , Blei, D. M. , & Tenenbaum, J. B. (in press). Integrating Topics and Syntax. In: Advances in Neural Information Processing Systems, 17. ¨ Griffiths, T. , & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228 -5235. Amruta Joshi, Stanford Univ. Research in Algorithms for the Inter. Net 44