Topic Models for Social Network Analysis and Bibliometrics

Topic Models for Social Network Analysis and Bibliometrics Andrew Mc. Callum Computer Science Department University of Massachusetts Amherst Joint work with � Xuerui Wang, Natasha Mohanty, Andres Corrada, Chris Pal, Wei Li, David Mimno and Gideon Mann.

From Text to Decisions Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Prediction Outlier detection Decision support 3

Joint Inference Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Emerging Patterns Prediction Outlier detection Decision support 4

Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Probabilistic Model Discover patterns - entity types - links / relations - events Discriminatively-trained undirected graphical models Document collection Conditional Random Fields [Lafferty, Mc. Callum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Complex Inference and Learning Just what we researchers like to sink our teeth into! Actionable knowledge Prediction Outlier detection Decision support 5

Context Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Joint inference among detailed steps Leveraging Text in Social Network Analysis Actionable knowledge Prediction Outlier detection Decision support 11

Outline Social Network Analysis with Topic Models • Role Discovery (Author-Recipient-Topic Model, ART) • Group Discovery (Group-Topic Model, GT) • Enhanced Topic Models – Correlations among Topics (Pachinko Allocation, PAM) – Time Localized Topics (Topics-over-Time Model, TOT) – Markov Dependencies in Topics (Topical N-Grams Model, TNG) • Bibliometric Impact Measures enabled by Topics Multi-Conditional Mixtures 12

Social Network in an Email Dataset 13

Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003] Generative Process: Example: For each document: Sample a distribution over topics, 70% Iraq war 30% US election For each word in doc Sample a topic, z Sample a word from the topic, w Iraq war “bombing” 14

Example topics induced from a large collection of text JOB SCIENCE BALL FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT IMAGINATION AUTHOR CAUSE POOL OPPORTUNITIES RESEARCH PLAY COIL READ MOMENT CAUSED LIKE WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS MANY BASKETBALL COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE CERTAIN UNDERWATER [Tennenbaum et al] 15

Example topics induced from a large collection of text JOB SCIENCE BALL FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT IMAGINATION AUTHOR CAUSE POOL OPPORTUNITIES RESEARCH PLAY COIL READ MOMENT CAUSED LIKE WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS MANY BASKETBALL COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE CERTAIN UNDERWATER [Tennenbaum et al] 16

From LDA to Author-Recipient-Topic (ART) [Mc. Callum et al 2005] 17

Inference and Estimation Gibbs Sampling: - Easy to implement - Reasonably fast r 18

Enron Email Corpus • 250 k email messages • 23 k people Date: Wed, 11 Apr 2001 06: 56: 00 -0700 (PDT) From: debra. perlingiere@enron. com To: steve. hooser@enron. com Subject: Enron/Trans. Alta. Contract dated Jan 1, 2001 Please see below. Katalin Kiss of Trans. Alta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002 dperlin@enron. com 19

Topics, and prominent senders / receivers Topic names, discovered by ART by hand 20

Topics, and prominent senders / receivers discovered by ART Beck = “Chief Operations Officer” Dasovich = “Government Relations Executive” Shapiro = “Vice President of Regulatory Affairs” Steffes = “Vice President of Government Affairs” 21

Comparing Role Discovery Traditional SNA ART Author-Topic distribution over authored topics connection strength (A, B) = distribution over recipients 22

Comparing Role Discovery Tracy Geaconne Dan Mc. Carty Traditional SNA ART Similar roles Different roles Author-Topic Different roles Geaconne = “Secretary” Mc. Carty = “Vice President” 23

Comparing Role Discovery Lynn Blair Kimberly Watson Traditional SNA Different roles ART Very similar Author-Topic Very different Blair = “Gas pipeline logistics” Watson = “Pipeline facilities planning” 24

Mc. Callum Email Corpus 2004 • January - October 2004 • 23 k email messages • 825 people From: kate@cs. umass. edu Subject: NIPS and. . Date: June 14, 2004 2: 27: 41 PM EDT To: mccallum@cs. umass. edu There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for: NIPS registration receipt. CALO registration receipt. Thanks, Kate 25

Four most prominent topics in discussions with ____? 26

27

Two most prominent topics in discussions with ____? 28

Role-Author-Recipient-Topic Models 30

Results with RART: People in “Role #3” in Academic Email • • olc gauthier irsystem allan valerie tech steve lead Linux sysadmin for CIIR group mailing list CIIR sysadmins mailing list for dept. sysadmins Prof. , chair of “computing committee” second Linux sysadmin mailing list for dept. hardware head of dept. I. T. support 31

Roles for allan (James Allan) • Role #3 • Role #2 I. T. support Natural Language researcher Roles for pereira (Fernando Pereira) • • • Role #2 Role #4 Role #6 Role #10 Role #8 Natural Language researcher SRI CALO project participant Grant proposal writer Grant proposal coordinator Guests at Mc. Callum’s house 32

ART: Roles but not Groups Traditional SNA Block structured ART Not Author-Topic Not Enron Trans. Western Division 33

Outline Social Network Analysis with Topic Models • Role Discovery (Author-Recipient-Topic Model, ART) a • Group Discovery (Group-Topic Model, GT) • Enhanced Topic Models – Correlations among Topics (Pachinko Allocation, PAM) – Time Localized Topics (Topics-over-Time Model, TOT) – Markov Dependencies in Topics (Topical N-Grams Model, TNG) • Bibliometric Impact Measures enabled by Topics Multi-Conditional Mixtures 34

Groups and Topics • Input: – Observed relations between people – Attributes on those relations (text, or categorical) • Output: – Attributes clustered into “topics” – Groups of people---varying depending on topic 35

Discovering Groups from Observed Set of Relations Student Roster Academic Admiration Adams Bennett Carter Davis Edwards Frederking Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Admiration relations among six high school students. 36

Adjacency Matrix Representing Relations Student Roster Academic Admiration Adams Bennett Carter Davis Edwards Frederking Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) A B C D E F G 1 G 2 G 3 G 3 ABCDEF A B C D E F G 1 G 2 G 3 A C B D E F G 1 G 1 G 2 G 2 G 3 G 3 A C B D E F G 1 G 2 G 3 37

Group Model: Partitioning Entities into Groups Stochastic Blockstructures for Relations [Nowicki, Snijders 2001] Beta Multinomial Dirichlet S: number of entities G: number of groups Binomial Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004] 38

Two Relations with Different Attributes Student Roster Academic Admiration Social Admiration Adams Bennett Carter Davis Edwards Frederking Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Soci(A, B) Soci(A, D) Soci(A, F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B) Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C) Soci(D, E) Soci(E, B) Soci(E, D) Soci(E, F) Soci(F, A) Soci(F, C) Soci(F, E) A C B D E F G 1 G 1 G 2 G 2 G 3 G 3 A C B D E F G 1 G 2 G 3 A C E B D F G 1 G 1 G 1 G 2 G 2 G 2 A C E B D F G 1 G 1 G 2 G 2 39

The Group-Topic Model: Discovering Groups and Topics Simultaneously [Wang, Mohanty, Mc. Callum 2006] Uniform Dirichlet Multinomial Beta Multinomial Dirichlet Binomial 40

Inference and Estimation Gibbs Sampling: - Many r. v. s can be integrated out - Easy to implement - Reasonably fast We assume the relationship is symmetric. 41

Dataset #1: U. S. Senate • 16 years of voting records in the US Senate (1989 – 2005) • a Senator may respond Yea or Nay to a resolution • 3423 resolutions with text attributes (index terms) • 191 Senators in total across 16 years S. 543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W. , Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102 -242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay …… 42

Topics Discovered (U. S. Senate) Mixture of Unigrams Education Energy Military Misc. Economic education school aid children drug students elementary prevention energy power water nuclear gas petrol research pollution government military foreign tax congress aid law policy federal labor insurance aid tax business employee care Foreign Economic Social Security + Medicare labor insurance tax congress income minimum wage business social security insurance medical care medicare disability assistance Education + Domestic Group-Topic Model education foreign school trade federal chemicals aid tariff government congress tax drugs energy communicable research diseases 43

Groups Discovered (US Senate) Groups from topic Education + Domestic 44

Senators Who Change Coalition the most Dependent on Topic e. g. Senator Shelby (D-AL) votes with the Republicans on Economic with the Democrats on Education + Domestic with a small group of maverick Republicans on Social Security + Medicaid 45

Dataset #2: The UN General Assembly • Voting records of the UN General Assembly (1990 - 2003) • A country may choose to vote Yes, No or Abstain • 931 resolutions with text attributes (titles) • 192 countries in total • Also experiments later with resolutions from 1960 -2003 Vote on Permanent Sovereignty of Palestinian People, 87 th plenary meeting The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions: In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia. 46

Topics Discovered (UN) Mixture of Unigrams Group-Topic Model Everything Nuclear Human Rights Security in Middle East nuclear weapons use implementation countries rights human palestine situation israel occupied israel syria security calls Nuclear Non-proliferation Nuclear Arms Race Human Rights nuclear states united weapons nations nuclear arms prevention race space rights human palestine occupied israel 47

Groups Discovered (UN) The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members. 48

Groups and Topics, Trends over Time (UN) 50

Outline Social Network Analysis with Topic Models • Role Discovery (Author-Recipient-Topic Model, ART) a • Group Discovery (Group-Topic Model, GT) a • Enhanced Topic Models – Correlations among Topics (Pachinko Allocation, PAM) – Time Localized Topics (Topics-over-Time Model, TOT) – Markov Dependencies in Topics (Topical N-Grams Model, TNG) • Bibliometric Impact Measures enabled by Topics Multi-Conditional Mixtures 51

Latent Dirichlet Allocation “images, motion, eyes” LDA 20 [Blei, Ng, Jordan, 2003] α θ z w β n φ N T visual model motion field object images objects fields receptive eye position spatial direction target vision multiple figure orientation location “motion, some junk” LDA 100 motion detection field optical flow sensitive moving functional detect contrast light dimensional intensity computer mt measures occlusion temporal edge real 52

Correlated Topic Model [Blei, Lafferty, 2005] logistic normal z w β n φ N T Square matrix of pairwise correlations. 53

Pachinko Machine 54

dan r o el J ame a h Mic the n o t ks esting n a Th sugg for Pachinko Allocation Model el d o [Li, Mc. Callum, 2005, 2006] 11 e, l m r tu ica 21 22 c u h tr ap s l gr e od the M ot n 31 32 41 word 2 42 word 3 Given: directed acyclic graph (DAG); at each interior node: a Dirichlet over its children and words at leaves 33 43 word 4 For each document: Sample a multinomial from each Dirichlet 44 word 5 word 6 45 word 7 For each word in this document: Starting from the root, sample a child from successive nodes, down to a leaf. Generate the word at the leaf word 8 Like a Polya tree, but DAG shaped, with arbitrary number of children. 55

Pachinko Allocation Model el d o [Li, Mc. Callum, 2005] 11 e, l m r tu ica 21 22 c u h tr ap s l gr e od the M ot n 31 32 41 word 2 42 word 3 33 43 word 4 DAG may have arbitrary structure • arbitrary depth • any number of children per node • sparse connectivity • edges may skip layers 44 word 5 word 6 45 word 7 word 8 56

Pachinko Allocation Model el d o [Li, Mc. Callum, 2005] 11 e, l m r tu ica 21 22 c u h tr ap s l gr e od the M ot n 31 32 41 word 2 42 word 3 Distributions over distributions over topics. . . Distributions over topics; 33 mixtures, representing topic correlations 43 word 4 44 word 5 word 6 45 Distributions over words (like “LDA topics”) word 7 word 8 Some interior nodes could contain one multinomial, used for all documents. (i. e. a very peaked Dirichlet) 57

Pachinko Allocation Model el d o [Li, Mc. Callum, 2005] 11 e, l m r tu ica 21 22 c u h tr ap s l gr e od the M ot n 31 32 41 word 2 42 word 3 Estimate all these Dirichlets from data. Estimate model structure from data. (number of nodes, and connectivity) 33 43 word 4 44 word 5 word 6 45 word 7 word 8 58

Pachinko Allocation Special Cases Latent Dirichlet Allocation 32 41 word 2 42 word 3 43 word 4 44 word 5 word 6 45 word 7 word 8 59

Pachinko Allocation Special Cases Hierarchical Latent Dirichlet Allocation (HLDA) 11 Very low variance Dirichlet at root 21 22 23 24 31 32 33 34 41 Each leaf of the HLDA topic hier. has a distr. over nodes on path to the root. The HLDA hier. 42 51 word 2 word 3 word 4 word 5 word 6 word 7 word 8 60

Pachinko Allocation on a Topic Hierarchy Combining best of HLDA and Pachinko Allocation 00 11 The PAM DAG. 12 21 22 23 24 31 32 33 34 41 . . . representing correlations among topic leaves. The HLDA hier. 42 51 word 2 word 3 word 4 word 5 word 6 word 7 word 8 61

Pachinko Allocation Model. . . with two layers, no skipping layers, fully-connected from one layer to the next. 11 21 32 23 33 “super-topics” 34 35 “sub-topics” fixed multinomials word 1 word 2 word 3 word 4 word 5 word 6 word 7 word 8 Another special case would select only one super-topic per document. 62

Graphical Models PAM LDA (with fixed multinomials for topics) α α q θ θ q z w z 1 β n φ N T z 2 … zm w β n φ N 63 T

Pachinko Allocation Model • Likelihood • Estimate z’s by Gibbs sampling • Estimate ’s by moment matching. 64

Preliminary Experimental Results • Topic Coherence • Likelihood on held-out data • Document classification 65

NIPS Dataset NIPS Conference Papers Volumes 0 -12 Spanning 1987 – 1999. Prepared by Sam Roweis. • 1740 papers • 13649 Words • 2, 301, 375 tokens 66

Topic Coherence Comparison “models, “estimation, stopwords” some junk” “estimation” LDA 20 LDA 100 PAM 100 models model parameters distribution bayesian probability estimation data gaussian methods likelihood em mixture show approach paper density framework approximation markov estimation likelihood maximum noisy estimates mixture scene surface normalization generated measurements surfaces estimating estimated iterative combined figure divisive sequence ideal estimation bayesian parameters data methods estimate maximum probabilistic distributions noise variables noisy inference variance entropy models framework statistical estimating Example super-topic 33 input hidden units function number 27 estimation bayesian parameters data m 24 distribution gaussian markov likelihood 11 exact kalman full conditional determini 67 1 smoothing predictive regularizers inter

Topic Coherence Comparison “images, motion eyes” LDA 20 “motion, some junk” LDA 100 visual model motion field object images objects fields receptive eye position spatial direction target vision multiple figure orientation location motion detection field optical flow sensitive moving functional detect contrast light dimensional intensity computer mt measures occlusion temporal edge real “motion” PAM 100 “eyes” PAM 100 “images” PAM 100 motion video surfaces figure scene camera noisy sequence activation generated analytical pixels measurements assigne advance lated shown closed perceptual eye head vor vestibulo oculomotor vestibular vary reflex vi pan rapid semicircular canals responds streams cholinergic rotation topographically detectors ning image digit faces pixel surface interpolation scene people viewing neighboring sensors patches manifold dataset magnitude transparency rich dynamical amounts 68 tor

Blind Topic Evaluation • Randomly select 25 similar pairs of topics generated from PAM and LDA • 5 people • Each asked “select the topic in each pair that you find most semantically coherent. ” Prefer PAM Topic counts LDA PAM 5 votes 0 5 >= 4 votes 3 8 >= 3 votes 9 16 70

Topic Correlations in PAM 5000 research paper abstracts, from across all CS 72 Numbers on edges are supertopics’ Dirichlet parameters

Likelihood on Held Out Data • Likelihood comparison – NIPS abstracts – Train the model with 75% data – Calculate likelihood on 25% data • Calculate likelihood by – Sampling many, many documents from the model – Estimating a simple mixture of multinomials from these – Calculate the likelihood of data under this simple mixture. 73

Likelihood Comparison Varying number of topics 74

Document Classification “Comp 5” from 20 Newsgroups corpus. Train on 25%, test on 75% Like Naive Bayes, but use LDA/PAM per-class instead of multinomial. ~2. 5% increase Test Accuracy (%) 75

Outline Social Network Analysis with Topic Models • Role Discovery (Author-Recipient-Topic Model, ART) a • Group Discovery (Group-Topic Model, GT) a • Enhanced Topic Models a a – Correlations among Topics (Pachinko Allocation, PAM) – Time Localized Topics (Topics-over-Time Model, TOT) – Markov Dependencies in Topics (Topical N-Grams Model, TNG) • Bibliometric Impact Measures enabled by Topics Multi-Conditional Mixtures 76

Want to Model Trends over Time • Is prevalence of topic growing or waning? • Pattern appears only briefly – Capture its statistics in focused way – Don’t confuse it with patterns elsewhere in time • How do roles, groups, influence shift over time? 77

Topics over Time (TOT) [Wang, Mc. Callum 2006] Dirichlet multinomial over topics Dirichlet prior z word w T Multinomial over words topic index T Uniform prior z topic index word T D multinomial over topics Dirichlet prior time stamp Nd time stamp t t Beta over time Uniform prior distribution on time stamps Beta over time w T Multinomial over words Nd D 78

State of the Union Address 208 Addresses delivered between January 8, 1790 and January 29, 2002. To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied. • 17156 ‘documents’ • 21534 words • 669, 425 tokens Our scheme of taxation, by means of which this needless surplus is taken from the people and put into the public Treasury, consists of a tariff or duty levied upon importations from abroad and internal-revenue taxes levied upon the consumption of tobacco and spirituous and malt liquors. It must be conceded that none of the things subjected to internal-revenue taxation are, strictly speaking, necessaries. There appears to be no just complaint of this taxation by the consumers of these articles, and there seems to be nothing so well able to bear the burden without hardship to any portion of the people. 1910 79

Comparing TOT against LDA 82

TOT on 17 years of NIPS proceedings 83

TOT on 17 years of NIPS proceedings TOT LDA 84

TOT versus LDA on my email 85

TOT improves ability to Predict Time Predicting the year of a State-of-the-Union address. L 1 = distance between predicted year and actual year. 86

Outline Social Network Analysis with Topic Models • Role Discovery (Author-Recipient-Topic Model, ART) a a • Group Discovery (Group-Topic Model, GT) a • Enhanced Topic Models a – Correlations among Topics (Pachinko Allocation, PAM) a – Time Localized Topics (Topics-over-Time Model, TOT) – Markov Dependencies in Topics (Topical N-Grams Model, TNG) • Bibliometric Impact Measures enabled by Topics Multi-Conditional Mixtures 87

Topics Modeling Phrases • Topics based only on unigrams often difficult to interpret • Topic discovery itself is confused because important meaning / distinctions carried by phrases. • Significant opportunity to provide improved language models to ASR, MT, IR, etc. 88

Topical N-gram Model [Wang, Mc. Callum 2005] z 1 z 2 y 1 y 2 w 1 z 3 z 4 y 3 w 2 . . . y 4 w 3 . . . w 4 . . . D 1 W T 1 2 2 W T 89

LDA Topical N-grams algorithm genetic problems efficient genetic algorithms genetic algorithm evolutionary computation evolutionary algorithms fitness function 90

Topic Comparison LDA learning optimal reinforcement state problems policy dynamic action programming actions function markov methods decision rl continuous spaces step policies planning Topical N-grams (2) reinforcement learning optimal policy dynamic programming optimal control function approximator prioritized sweeping finite-state controller learning system reinforcement learning rl function approximators markov decision problems markov decision processes local search state-action pair markov decision process belief states stochastic policy action selection upright position reinforcement learning methods Topical N-grams (1) policy action states actions function reward control agent q-learning optimal goal learning space step environment system problem steps sutton policies 92

Topic Comparison LDA Topical N-grams (2) motion visual field position figure direction fields eye location retina receptive velocity vision moving system flow edge center light local receptive field spatial frequency temporal frequency visual motion energy tuning curves horizontal cells motion detection preferred direction visual processing area mt visual cortex light intensity directional selectivity high contrast motion detectors spatial phase moving stimuli decision strategy visual stimuli Topical N-grams (1) motion response direction cells stimulus figure contrast velocity model responses stimuli moving cell intensity population image center tuning complex directions 93

Topic Comparison LDA word system recognition hmm speech training performance phoneme words context systems frame trained speaker sequence speakers mlp frames segmentation models Topical N-grams (2) Topical N-grams (1) speech recognition training data neural network error rates neural net hidden markov model feature vectors continuous speech training procedure continuous speech recognition gamma filter hidden control speech production neural nets input representation output layers training algorithm test set speech frames speaker dependent speech word training system recognition hmm speaker performance phoneme acoustic words context systems frame trained sequence phonetic speakers mlp hybrid 94

Outline Social Network Analysis with Topic Models • Role Discovery (Author-Recipient-Topic Model, ART) a a • Group Discovery (Group-Topic Model, GT) a • Enhanced Topic Models a – Correlations among Topics (Pachinko Allocation, PAM) a – Time Localized Topics (Topics-over-Time Model, TOT) a – Markov Dependencies in Topics (Topical N-Grams Model, TNG) • Bibliometric Impact Measures enabled by Topics Multi-Conditional Mixtures 95

Social Networks in Research Literature • Better understand structure of our own research area. • Structure helps us learn a new field. • Aid collaboration • Map how ideas travel through social networks of researchers. • Aids for hiring and finding reviewers! 96

Traditional Bibliometrics • Analyses a small amount of data (e. g. 19 articles from a single issue of a journal) • Uses “journal” as a proxy for “research topic” (but there is no journal for information extraction) • Uses impact measures almost exclusively based on simple citation counts. How can we use topic models to create new, interesting impact measures? 97

Our Data • Over 1 million research papers, gathered as part of Rexa. info portal. • Cross linked references / citations. 98

99

100

101

102

103

104

105

Finding Topics with TNG Traditional unigram LDA run on 1 million titles / abstracts (200 topics) . . . select ~300 k papers on ML, NLP, robotics, vision. . . Find 200 TNG topics among those papers. 106

Topical Bibliometric Impact Measures [Mann, Mimno, Mc. Callum, 2006] • Topical Citation Counts • Topical Impact Factors • Topical Longevity • Topical Diversity • Topical Precedence • Topical Transfer 107

Topical Diversity Entropy of the topic distribution among papers that cite this paper (this topic). Low Diversity High Diversity 108

Topical Diversity Can also be measured on particular papers. . . 109

” Topical Precedence ss e y-n “ rl a E Within a topic, what are the earliest papers that received more than n citations? Information Retrieval: On Relevance, Probabilistic Indexing and Information Retrieval, Kuhns and Maron (1960) Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968) Relevance feedback in information retrieval, Rocchio (1971) Relevance feedback and the optimization of retrieval effectiveness, Salton (1971) New experiments in relevance feedback, Ide (1971) Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, 110 Feiten and Gunzel (1982)

” Topical Precedence ss e y-n “ rl a E Within a topic, what are the earliest papers that received more than n citations? Speech Recognition: Some experiments on the recognition of speech, with one and two ears, E. Colin Cherry (1953) Spectrographic study of vowel reduction, B. Lindblom (1963) Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965) Effectiveness of linear prediction characteristics of the speech wave for. . . , B. Atal (1974) Automatic Recognition of Speakers from Their Voices, B. Atal (1976) 111

Topical Transfer from Digital Libraries to other topics Other topic Cit’s Paper Title Web Pages 31 Trawling the Web for Emerging Cyber. Communities, Kumar, Raghavan, . . . 1999. Computer Vision 14 On being ‘Undigital’ with digital cameras: extending the dynamic. . . Video 12 Lessons learned from the creation and deployment of a terabyte digital video Graphs 12 Trawling the Web for Emerging Cyber. Communities Web Pages 11 Web. Base: a repository of Web pages 112

Topical Transfer Citation counts from one topic to another. Map “producers and consumers” 113

Outline Social Network Analysis with Topic Models • Role Discovery (Author-Recipient-Topic Model, ART) a a • Group Discovery (Group-Topic Model, GT) a • Enhanced Topic Models a – Correlations among Topics (Pachinko Allocation, PAM) a – Time Localized Topics (Topics-over-Time Model, TOT) a – Markov Dependencies in Topics (Topical N-Grams Model, TNG) • Bibliometric Impact Measures enabled by Topics a Multi-Conditional Mixtures 114

Topic Model Musings • 3 years ago Latent Dirichlet Allocation appeared as a complex innovation. . . but now these methods & mechanics are well-understood. • Innovation now is to understand data and modeling needs, how to structure a new model to capture these. 115

Outline Social Network Analysis with Topic Models • Role Discovery (Author-Recipient-Topic Model, ART) a a • Group Discovery (Group-Topic Model, GT) a • Enhanced Topic Models a – Correlations among Topics (Pachinko Allocation, PAM) a – Time Localized Topics (Topics-over-Time Model, TOT) a – Markov Dependencies in Topics (Topical N-Grams Model, TNG) • Bibliometric Impact Measures enabled by Topics a Multi-Conditional Mixtures 116

Want a “topic model” with the advantages of CRFs • Use arbitrary, overlapping features of the input. • Undirected graphical model, so we don’t have to think about avoiding cycles. • Integrate naturally with our other CRF components. • Train “discriminatively” ed! an? s e i v m r e up this s s n e u o re td Wha models a c Topi • Natural semi-supervised training 117

“Multi-Conditional Mixtures” Latent Variable Models fit by Multi-way Conditional Probability [Mc. Callum, Wang, Pal, 2005], [Mc. Callum, Pal, Wang, 2006] • For clustering structured data, ala Latent Dirichlet Allocation & its successors • But an undirected model, like the Harmonium [Welling, Rosen-Zvi, Hinton, 2005] • But trained by a “multi-conditional” objective: O = P(A|B, C) P(B|A, C) P(C|A, B) e. g. A, B, C are different modalities 118

Objective Functions for Parameter Estimation Traditional, joint training (e. g. naive Bayes, most topic models) Traditional mixture model (e. g. LDA) Traditional, conditional training (e. g. Max. Ent classifiers, CRFs) New, multi-conditional Conditional mixtures (e. g. Jebara’s CEM, Mc. Callum CRF string edit distance, . . . ) Multi-conditional (mostly conditional, generative regularization) Multi-conditional (for semi-sup) Multi-conditional (for transfer learning, 2 tasks, shared hiddens) 119

“Multi-Conditional Learning” (Regularization) [Mc. Callum, Pal, Wang, 2006] 120

Predictive Random Fields mixture of Gaussians on synthetic data Data, classify by color Generatively trained [Mc. Callum, Wang, Pal, 2005] Multi-Conditionally-trained [Jebara 1998] 121

Multi-Conditional Mixtures vs. Harmoniun on document retrieval task [Mc. Callum, Wang, Pal, 2005] Multi-Conditional, multi-way conditionally trained Conditionally-trained, to predict class labels Harmonium, joint, with class labels and words Harmonium, joint with words, no labels 122

Outline Social Network Analysis with Topic Models • Role Discovery (Author-Recipient-Topic Model, ART) • Group Discovery (Group-Topic Model, GT) • Enhanced Topic Models – Correlations among Topics (Pachinko Allocation, PAM) – Time Localized Topics (Topics-over-Time Model, TOT) – Markov Dependencies in Topics (Topical N-Grams Model, TNG) • Bibliometric Impact Measures enabled by Topics Multi-Conditional Mixtures 123

Summary 124

Assigning topics to documents 1. Build a 200 topic n-gram topic model on 300 k documents 2. Remove stopword or methodological topics (e. g. “efficient, fast, speed”) 3. For each document d, if more than 10% of d’s tokens are assigned to topic t, and that comprises more than two tokens, assign d to t Each topic is now an intellectual “domain” that includes some number of documents. We can substitute topic for journal in most traditional bibliometric indicators. We can also now define several new indicators. 125

Impact Factor Journal Impact Factor: Citations from articles published in 2004 to articles in Cell published in 2002 -3, divided by the number of articles published in Cell in 2002 -3. 2004 Impact factors from JCR: Nature 32. 182 Cell 28. 389 JMLR 5. 952 Machine Learning 3. 258 126

Topic Impact Factor 127

Broad Impact: Diffusion Journal Diffusion: # of journals citing Cell divided by the total number of citations to Cell, over a given time period, times 100 Problem: relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion. 128

Broad Impact: Diversity Topic Diversity: Entropy of the distribution of citing topics Better at capturing broad end of impact spectrum: the high diffusion topics are identical to the least frequently cited topics 129

Broad Impact: Diversity Topic Diversity: Entropy of the distribution of citing topics Topic diversity can also be measured for papers: 130

Longevity: Cited Half Life Two views: • Given a paper, what is the median age of citations to that paper? • What is the median age of citations from current literature? 131

History: Topical Precedence Within a topic, what are the earliest papers that received more than n citations? Information Retrieval (138): On Relevance, Probabilistic Indexing and Information Retrieval, Kuhns and Maron (1960) Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968) Relevance feedback in information retrieval, Rocchio (1971) Relevance feedback and the optimization of retrieval effectiveness, Salton (1971) New experiments in relevance feedback, Ide (1971) Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982) 132