i Topic Model Information Network Integrated Topic Modeling

Outline n Background and motivation n Related work n Modeling n n i. Topic.

Background n Topic Modeling n Documents are independent from each other 3

Background (Cont. ) n Document networks in real world n Documents integrated with information

Motivation n n Goal: Use document networks to improve the quality of topic models

How Existing Topic Models Deal with Links? n n Traditional topic models: PLSA and

Why Not Pure Network Clustering? n Network clustering or graph partition algorithms do not

Our Method n n Builds a unified generative model for links (structure) and text

Model Set-Up n Graphical model for i. Topic. Model n ϴi=(ϴi 1, ϴi 2,

Objective Function n Objective function: joint probability n Structure part n n n Text

I. How to Model the Structure Part? n n Joint distribution P(ϴ|G) n Need

The Bridge: MRF n n Markov Random Field (MRF) can connect the two definitions!

Local Probability Definition: Heuristics n n In our case, we build multivariate MRF for

Local Probability Definition: Formula n Model the heuristics using Dirichlet distribution n We use

Check Heuristics (1) n A document’s topic distribution ϴi should be very related to

Check Heuristics (2) n The expected value of ϴi should be close to weighted

Check Heuristics (3) n The larger strength of the link from a document to

Example of Precision n Beta(2, 2) vs. Beta(50, 50) n Beta distribution is a

Global Probability Definition n Give the global definition corresponding to local definition: n Cliques

Local and Global Definition Equivalence n For a local structure: 22

II. How to Model the Text Part? n P(X|ϴ, β)= n n Each document

Parameter Estimations n Objective function: find parameters ϴ, β that maximizes the log-likelihood of

Discussions of MRF Modeling on Network Structure n Can we define other MRFs on

Decide Topic Number n n Q-function n Evaluate the modularity of a network Best

Build Topic Hierarchies n n Different document network has inherent granulites of topics n

Correlations of Text and Network n n Consider two extreme cases of network and

Datasets n n DBLP n Conf network, author network Cora n Paper network via

Case Study: Topic Hierarchy Building n 32

Case Study: Topic Hierarchy Building n First level topic number n 33

Performance Study n Document clustering accuracy Improve most for short text document 34

Correlation Study n If network is built from text, the performance will not be

Conclusions n n i. Topic. Model, a unified model for document networks, that provides

Slides: 38

Download presentation

i. Topic. Model: Information Network. Integrated Topic Modeling Yizhou Sun, Jiawei Han, Jing Gao and Yintao Yu Department of Computer Science University of Illinois at Urbana-Champaign 12/9/2009 1

Outline n Background and motivation n Related work n Modeling n n i. Topic. Model building and parameter estimation n Discussions of MRF modeling on network structure Practical Issues n Decide topic number n Build topic hierarchies n Correlations of text and network n Experiments n Conclusion 2

Background n Topic Modeling n Documents are independent from each other 3

Background (Cont. ) n Document networks in real world n Documents integrated with information networks n n + Papers are linked via citations Webpages are linked via hyperlinks Blog users are linked via friendship relations …… 4

Motivation n n Goal: Use document networks to improve the quality of topic models Why and how links in document network may help build better topic models? n Text information from neighbors are utilized n Extend the co-occurrences among words n Extremely useful for short text documents n Derive topic models consistent with current document network n Neighbors should have similar topic distributions n Determine the number of topics n The network tells the structure 5

How Existing Topic Models Deal with Links? n n Traditional topic models: PLSA and LDA n Have not considered the links between documents Author-Topic Model n Consider links between authors and papers, and model is depended on particular domain Net. PLSA n Consider links between documents, and consider the network as a regularization constraint n Only works for undirected networks Relational Topic Model (RTM) n Model how each link is generated based on topic distribution n Only works for binary networks; try to make predictions on links based on purely topic information 7

Why Not Pure Network Clustering? n Network clustering or graph partition algorithms do not use text information at all n n n The clusters are difficult to understand The quality of clusters is not as good as topic models, due to using less information The network itself may be not connected, tend to generate clusters of such outliers n E. g. , co-author network 8

Our Method n n Builds a unified generative model for links (structure) and text (content) Works for directed / undirected, weighted / unweighted document networks 9

Model Set-Up n Graphical model for i. Topic. Model n ϴi=(ϴi 1, ϴi 2, …, ϴi. T): topic distribution for document xi Structural Layer: follow the same topology as the document network Text Layer: follow PLSA, i. e. , for each word, pick a topic z~multi(ϴi), then pick a word w~multi(βz) 11

Objective Function n Objective function: joint probability n Structure part n n n Text part Can model them separately! X: observed text information G: document network Parameters n n ϴ: topic distribution; β: word distribution ϴ is the most critical, need to be consistent with the text as well as the network structure 12

I. How to Model the Structure Part? n n Joint distribution P(ϴ|G) n Need a global definition The dilemma of global definition and local definition n Computational-wise, a global definition should be given n n what is the probability to get a structure configuration ϴ given current network G? Semantic-wise, heuristics can only give local definitions P(ϴi|ϴN(i)) n If we know the neighbor configurations of a node, it is very likely to have the probability P(ϴi|ϴN(i)) 13

The Bridge: MRF n n Markov Random Field (MRF) can connect the two definitions! What is an MRF? n Given a graph G, each node i associating with a random variable Fi, F is said to be an MRF, if Local n n n p(f)>0 p(fi|f-i)=p(fi|f. N(i)) Markovianity Property An MRF can be factorized into the form n p(f) = 1/Z exp{-∑c. Vc(f)} n Global n n Z: Partition function, can be viewed as normalization constant for each MRF C: clique in the graph Vc(f): potential function for clique c 14

Local Probability Definition: Heuristics n n In our case, we build multivariate MRF for ϴ Heuristics for local definition n A document’s topic distribution ϴi should be very related to its neighbors, especially outneighbors n The expected value of ϴi should be close to weighted mean of its neighbors n The larger strength of the link from a document to its neighbors, the more we can trust the neighbors (mean of them), i. e. , a higher probability around the mean 15

Local Probability Definition: Formula n Model the heuristics using Dirichlet distribution n We use neighbors to construct a Dirichlet parameter for each document xi n n Define the local probability using Dirichlet distribution with the parameter: n 16

Check Heuristics (1) n A document’s topic distribution ϴi should be very related to its neighbors, especially outneighbors n done 17

Check Heuristics (2) n The expected value of ϴi should be close to weighted mean of its neighbors n n If set all 1, a uniform prior is added 18

Check Heuristics (3) n The larger strength of the link from a document to its neighbors, the more we can trust the neighbors (mean of them), i. e. , a higher probability around the mean. n The precision of Dirichlet distribution tells how confident a configuration is around mean, which is 19

Example of Precision n Beta(2, 2) vs. Beta(50, 50) n Beta distribution is a two dimensional Dirichlet distribution, p=(p 1, p 2)~Beta(α, β), p 1+ p 2=1 f(p 1, p 2) p 1 20

Global Probability Definition n Give the global definition corresponding to local definition: n Cliques only use single nodes and links n Potential function for larger cliques are set to 0 n Potential function: n Joint distribution: 21

Local and Global Definition Equivalence n For a local structure: 22

II. How to Model the Text Part? n P(X|ϴ, β)= n n Each document is conditional independent given current structure Each document is modeled as in PLSA n 23

Parameter Estimations n Objective function: find parameters ϴ, β that maximizes the log-likelihood of joint probability n n Approximate inference using EM: n Structure part Text part 24

Discussions of MRF Modeling on Network Structure n Can we define other MRFs on network structure? n Yes. Net. PLSA is a special case n Local definition n n Global definition n 25

Decide Topic Number n n Q-function n Evaluate the modularity of a network Best topic number T n maximize the Q-function by varying T 27

Build Topic Hierarchies n n Different document network has inherent granulites of topics n E. g. , consider conf network, co-author network and co-citation network Using Q-function to decide the number of branches 28

Correlations of Text and Network n n Consider two extreme cases of network and text n The links of the network among documents are randomly formed, and in this case network structure will not help topic modeling, and even deteriorate the performance of results n The links of the network among documents are built exactly through the text information, and in this case network structure will not improve the topic modeling performance too much Correlation n 29

Datasets n n DBLP n Conf network, author network Cora n Paper network via citation, co-author, and text 31

Case Study: Topic Hierarchy Building n 32

Case Study: Topic Hierarchy Building n First level topic number n 33

Performance Study n Document clustering accuracy Improve most for short text document 34

Correlation Study n If network is built from text, the performance will not be improved too much 35

Conclusions n n i. Topic. Model, a unified model for document networks, that provides an efficient approximate inference method for the model Study some practical issues to use i. Topic. Model Experiments show that i. Topic. Model have good performance Future Work n How to combine different networks n Whether other priors provide better results n Topic prediction based on links 37

n Q&A Thanks! 38