A Labeled LDA Approach to the Dynamics of

A Labeled LDA Approach to the Dynamics of Collaboration Nikhil Johri CS 224 N 1

Motivating Questions � What is the value added from academic collaboration? � Division of labor? � Mixture of individual contributions? � New, synergistic ideas? � Can we identify different collaboration styles? � Synergy between established authors � Ideas from newer vs. older authors � Advisor + apprentices � What are the characteristics of influential collaborations and collaborators? 2

Dataset � ACL (Association of Computational Linguistics) Corpus � 16, 000+ papers � Ranges from 1965 to 2009 � Collaborations � 7, 500+ 3 papers with 2 or more authors

Methodology � Labeled � Cosine � Look 4 LDA Similarities for Significant Patterns

Labeled LDA (Ramage et al. ) � Variation of Latent Dirichlet Allocation (LDA) � Topics are constrained to be about specific tags associated with the documents � In this case, tags = authors � Result: a probabilistic term ‘signature’ for each author per year Daniel Jurafsky semantic chinese different sense corpora roles corpus 5 Chris Manning models entailment inference local semantic named Gerald Penn Bill Mac. Cartney types features parametric hpsg logic structure grammar semantic alignment entailment natural logic nli inference

Methodology � Labeled � Cosine � Look 6 LDA Similarities for Significant Patterns

Cosine Similarity Document Term. Vector Term Frequency disfluencie 3 s Author 1 Term. Signature Weight Term prosodic 12. 8142 semantic 19. 2921 intensity 15. 1259 … factors 3 prosodic 2 Term intensity 1 duration 2 disfluencie 1. 9976 s … 7 = Similarity between author 1 and document Weight intensity 1. 8909 entailment 13. 1920 …Author 2 Term. Signature = Similarity between author 2 and document

Methodology � Labeled � Cosine � Look 8 LDA Similarities for Significant Patterns

Sample Results � Average established author similarity score to papers � Break down by subfield � High similarity = more rigid, formal, requires training � Low similarity = more flexible, less defined, open to novelty Topic Score Prosody 0. 234 1 Question Answering / Dialog 0. 128 3 Unification Based Grammars 0. 223 6 Sentiment Analysis 0. 146 5 Bilingual Word Alignment 0. 219 7 Summarization 0. 149 6 High Grammar Similarity Scores Categorial + 0. 219 Logic 5 Low Similarity Scores 0. 150 Planning/BDI 5 Statistical Machine 9 Translation Anaphora Resolution 0. 219 0 0. 155 8

Sample Results � Identification of ‘hedgehogs’ and ‘foxes’ � Hedgehogs specialize in a single area � Foxes dabble in several areas Topic Score Marcus, Mitchell P. 0. 099 9 Koehn, Philipp 0. 434 6 Pustejovsky, James D. 0. 104 7 Pedersen, Ted 0. 411 5 Pereira, Fernando C. N. 0. 143 4 Och, Franz Josef Top ‘Fox’ Allen, James F. Authors 0. 144 6 Hahn, Udo 10 0. 150 1 0. 396 7 Top ‘Hedgehog’ Authors 0. 373 Ney, Hermann 0 Sumita, Eiichiro 0. 367 1

Conclusion � Suggested a system to determine author deviation from previous work on later papers � Tested the system on ACL collaborations � Presented preliminary results showing: � Hedgehog / fox style collaborators � Subfields that offer more flexibility for unestablished authors vs those that require more training � Stated a theory of collaboration styles and described how to use the system to identify these 11