Social Network Analysis with Textual Attributes Xuerui Wang
Social Network Analysis with Textual Attributes Xuerui Wang Natasha Mohanty Andrew Mc. Callum Computer Science Department University of Massachusetts, Amherst
Social Network in an Email Dataset 2
From LDA to Author-Recipient-Topic (ART) 6
Inference and Estimation Gibbs Sampling: - Easy to implement - Reasonably fast r 7
Enron Email Corpus • 250 k email messages • 23 k people Date: Wed, 11 Apr 2001 06: 56: 00 -0700 (PDT) From: debra. perlingiere@enron. com To: steve. hooser@enron. com Subject: Enron/Trans. Alta. Contract dated Jan 1, 2001 Please see below. Katalin Kiss of Trans. Alta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002 dperlin@enron. com 8
Topics, and prominent senders / receivers Topic names, discovered by ART by hand 9
Topics, and prominent sender/receivers discovered by ART Beck = “Chief Operations Officer” Dasovich = “Government Relations Executive” Shapiro = “Vice President of Regulatory Affairs” Steffes = “Vice President of Government Affairs” 10
Comparing Role Discovery Traditional SNA ART Author-Topic distribution over authored topics connection strength (A, B) = distribution over recipients 11
Comparing Role Discovery Tracy Geaconne Dan Mc. Carty Traditional SNA ART Similar roles Different roles Author-Topic Different roles Geaconne = “Secretary” Mc. Carty = “Vice President” 12
Comparing Role Discovery Lynn Blair Kimberly Watson Traditional SNA Different roles ART Very similar Author-Topic Very different Blair = “Gas pipeline logistics” Watson = “Pipeline facilities planning” 14
ART: Roles but not Groups Traditional SNA Block structured ART Not Author-Topic Not Enron Trans. Western Division 20
Groups and Topics • Input: – Observed relations between people – Attributes on those relations (text, or categorical) • Output: – Attributes clustered into “topics” – Groups of people---varying depending on topic 21
Discovering Groups from Observed Set of Relations Student Roster Academic Admiration Adams Bennett Carter Davis Edwards Frederking Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Admiration relations among six high school students. 22
Adjacency Matrix Representing Relations Student Roster Academic Admiration Adams Bennett Carter Davis Edwards Frederking Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) A B C D E F G 1 G 2 G 3 G 3 ABCDEF A B C D E F G 1 G 2 G 3 A C B D E F G 1 G 1 G 2 G 2 G 3 G 3 A C B D E F G 1 G 2 G 3 23
Group Model: Partitioning Entities into Groups Stochastic Blockstructures for Relations [Nowicki, Snijders 2001] Beta Multinomial Dirichlet S: number of entities G: number of groups Binomial Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004] 24
Two Relations with Different Attributes Student Roster Academic Admiration Social Admiration Adams Bennett Carter Davis Edwards Frederking Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Soci(A, B) Soci(A, D) Soci(A, F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B) Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C) Soci(D, E) Soci(E, B) Soci(E, D) Soci(E, F) Soci(F, A) Soci(F, C) Soci(F, E) Relation attributes influence grouping ! A C B D E F G 1 G 1 G 2 G 2 G 3 G 3 A C B D E F G 1 G 2 G 3 A C E B D F G 1 G 1 G 1 G 2 G 2 G 2 A C E B D F G 1 G 1 G 2 G 2 25
Simple Topic Model: Good for Single Topic Documents Mixture of Unigrams Uniform Dirichlet Multinomial D: number of documents T: number of topics : number of tokens in document d 26
Goal: Model relations and their (textual) attributes simultaneously to obtain better groups and more meaningful topics. budget, funding, annual, cash document, corrections, review, annual 27
The Group-Topic Model: Discovering Groups and Topics Simultaneously Uniform Dirichlet Multinomial Beta Multinomial Dirichlet Binomial 28
Inference and Estimation Gibbs Sampling: - Many r. v. s can be integrated out - Easy to implement - Reasonably fast We assume the relationship is symmetric. 29
Dataset #1: U. S. Senate • 16 years of voting records in the US Senate (1989 – 2005) • a Senator may respond Yea or Nay to a resolution • 3423 resolutions with text attributes (index terms) • 191 Senators in total across 16 years S. 543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W. , Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102 -242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay …… 30
Topics Discovered (U. S. Senate) Mixture of Unigrams Education Energy Military Misc. Economic education school aid children drug students elementary prevention energy power water nuclear gas petrol research pollution government military foreign tax congress aid law policy federal labor insurance aid tax business employee care Foreign Economic Social Security + Medicare labor insurance tax congress income minimum wage business social security insurance medical care medicare disability assistance Education + Domestic Group-Topic Model education foreign school trade federal chemicals aid tariff government congress tax drugs energy communicable research diseases 31
Groups Discovered (US Senate) Groups from topic Education + Domestic 32
Senators Who Change Coalition the most Dependent on Topic e. g. Senator Shelby (D-AL) votes with the Republicans on Economic with the Democrats on Education + Domestic with a small group of maverick Republicans on Social Security + Medicaid 33
Dataset #2: The UN General Assembly • Voting records of the UN General Assembly (1990 - 2003) • A country may choose to vote Yes, No or Abstain • 931 resolutions with text attributes (titles) • 192 countries in total • Also experiments later with resolutions from 1960 -2003 Vote on Permanent Sovereignty of Palestinian People, 87 th plenary meeting The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions: In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia. 34
Topics Discovered (UN) Mixture of Unigrams Group-Topic Model Everything Nuclear Human Rights Security in Middle East nuclear weapons use implementation countries rights human palestine situation israel occupied israel syria security calls Nuclear Non-proliferation Nuclear Arms Race Human Rights nuclear states united weapons nations nuclear arms prevention race space rights human palestine occupied israel 35
Groups Discovered (UN) The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members. 36
Do We Get Better Groups with the GT Model? Baseline Model 1. Cluster bills into topics using mixture of unigrams; 2. Apply group model on topicspecific subsets of bills. GT Model 1. Jointly cluster topic and groups at the same time using the GT model. Datasets Avg. AI for Baseline Avg. AI for GT p-value Senate 0. 8198 0. 8294 <. 01 UN 0. 8548 0. 8664 <. 01 Agreement Index (AI) measures group cohesion. Higher, better. 37
Groups and Topics, Trends over Time (UN) 38
An Alternative Group-Topic Model: “mixture of groups” Original GT model with mixture of groups See also Latent mixed membership model [Airoldi, Blei, Xing, Fienberg 2005] We thank Chris Pal for helpful discussion regarding the models. 39
End of talk 40
- Slides: 31