UTOPIAN UserDriven Topic Modeling Based on Interactive Nonnegative
UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization Jaegul Choo 1*, Changhyun Lee 1, Chandan K. Reddy 2, and Haesun Park 1 1 Georgia Institute of Technology, 2 Wayne State University *e-mail: jaegul. choo@cc. gatech. edu
Intro: Topic Modeling Document 1 Document 2 Document 3 Document 4 brain evolve dna genetic gene nerve neuron life organism
Intro: Topic Modeling Document 1 Document 2 Document 3 Document 4 Topic 1 Topic 2 Topic 3 Topic: a distribution over keywords brain evolve dna genetic gene nerve neuron life organism
Intro: Topic Modeling Document 1 Document 2 Document 3 Document 4 Document : a distribution over topic Topic 1 Topic 2 Topic 3 Topic: a distribution over keywords brain evolve dna genetic gene nerve neuron life organism
Latent Dirichlet Allocation (LDA) in Visual Analytics • LDA has been widely used in visual analytics. • TIARA [Wei et al. KDD 10], i. Vis. Clustering [Lee et al. Euro. Vis 12], Parallel. Topics [Dou et al. VAST 12], Topic. Viz [Eisenstein et al. CHI-WIP 12], … *Image courtesy of original papers.
Overview of Our Work • Proposes nonnegative matrix factorization (NMF) for topic modeling. • Highlights advantages of NMF over LDA in visual analytics. • Presents UTOPIAN, an NMF-based interactive topic modeling system. Topic merging Topic splitting Keyword-induced topic creation Doc-induced topic creation
What is Nonnegative Matrix Factorization?
Nonnegative Matrix Factorization (NMF) Lower-rank approximation with nonnegativity constraints H A ~ = èmin || A – WH ||F W W>=0, H>=0 Why nonnegativity? Ø Easy interpretation and semantically meaningful output Algorithm Ø Alternating nonnegativity-constrained least squares [Kim et al. , 2008]
H NMF as Topic Modeling A ~ = W Document 1 Document 2 Document 3 Document 4 Document : a distribution over topic Topic 1 Topic 2 Topic 3 Topic: a distribution over keywords brain evolve dna genetic gene nerve neuron life organism
Why NMF in Visual Analytics?
Advantages of NMF in Visual Analytics • Reliable algorithmic behaviors • Flexible support for user interactions
NMF vs. LDA Consistency from Multiple Runs Documents’ topical membership changes among 10 runs Info. Vis/VAST paper data set 20 newsgroup data set
NMF vs. LDA Empirical Convergence Documents’ topical membership changes between iterations Info. Vis/VAST paper data set 48 seconds 10 minutes NMF LDA
NMF vs. LDA Topic Summary (Top Keywords) Info. Vis/VAST paper data set NMF LDA Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Run #1 visualization design information user analysis system graph layout visual analytics data sets color weaving Run #2 visualization design information user analysis system graph layout visual analytics data sets color weaving Run #1 documents similarities knowledge query collaborative social tree measures multivariate tree animation dimensions treemap Run #2 documents query analysts scatterplot spatial collaborative text documents multidimensi onal, high tree aggregation dimensions treemap Ø Topics are more consistent in NMF than in LDA. Ø Topic quality is comparable between NMF and LDA.
Advantages of NMF in Visual Analytics • Reliable algorithmic behaviors • Flexible support for user interactions
Weakly Supervised NMF [Choo et al. , DMKD, accepted with rev. ] min ||A – WH ||F 2 + α||(W – Wr)MW ||F 2 + β||MH(H – DHHr) ||F 2 W>=0, H>=0 • Wr, Hr : reference matrices for W and H • MW, MH : diagonal matrices for weighting/masking columns/rows of W and H Ø Provides flexible yet intuitive means for user interaction. Ø Maintains the same computational complexity as original NMF.
UTOPIAN: User-Driven Topic Modeling Based on Interactive NMF Topic merging Keyword-induced topic creation Doc-induced topic creation Topic splitting
UTOPIAN Overview Supervised t-distributed stochastic neighbor embedding (t-SNE) User interactions supported • Keyword refinement • Topic merging/splitting • Keyword-/document-induced topic creation Real-time interaction via PIVE (Per-Iteration Visualization Environment) Topic merging Topic splitting Keyword-induced topic creation Doc-induced topic creation
Supervised t-SNE Original t-SNE • Documents are often too noisy to work with. Supervised t-SNE • d(xi, xj) ← α • d(xi, xj) if xi and xj belongs to the same topic cluster.
PIVE (Per-Iteration Visualization Environment) for Real-time Interaction [Choo et al. , under revision] Standard approach PIVE approach
Demo Video http: //tinyurl. com/UTOPIAN 2013
Usage Scenario: Hyundai Genesis Review Data Initial result After interaction
Summary • Presented UTOPIAN, a User-Driven Topic Modeling based on Interactive NMF. • Highlighted the advantages of NMF over LDA in visual analytics. • Reliable algorithmic behaviors • Consistency from multiple runs • Early empirical convergence • Flexible support for user interactions • Keyword refinement • Topic merging/splitting • Keyword-/document-induced topic creation
More in the paper & On-going Work • A general taxonomy of user interactions with computational methods • Keyword-based vs. document-based • Template-based vs. from-scratch-based • Algorithmic details about supported user interactions • Implementation details • More usage scenarios On-going Work • Scaling up the system with parallel distributed NMF
Jaegul Choo Thank you! http: //tinyurl. com/UTOPIAN 2013 Topic merging jaegul. choo@cc. gatech. edu http: //www. cc. gatech. edu/~joyfull/ Keyword-induced topic creation Doc-induced topic creation Topic splitting For more details, please find me at ‘Meet the Candidate’ A 601+ A 602, 6 PM today
- Slides: 25