Detecting Research Topics via the Correlation between Graphs
Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze†, and C. Lee Giles‡ † Dept. of Computer Science, Cornell University, ‡ Information Sciences and Technology, The Pennsylvania State University
Acknowledgment • • John E. Hopcroft Thorsten Joachims Simeon Warner Isaac G. Councill • NSF IIS-0430906, 0227648, 0227888, and 0424671
Topic detection • Problem Statement : How to detect topics in a linked corpus (e. g. Citeseer, ar. Xiv, the Web …) • Our strategy : – The correlation between • Distribution of terms representing a topic • Distribution of citation links
Correlation between Terms and Links Term citation graph for α Term citation graph for η α η α α η η η α α α Term α : representing a topic (e. g. “sensor network’’, or “association rule’’ ) η η η Term η : not representing a topic (e. g. “six months’’, or “practical examples’’ )
Term citation graph for a term α α α
Correlation between Terms and Links Term citation graph for α Term citation graph for η α η α α η η η α α α Term α : representing a topic (e. g. “sensor network’’, or “association rule’’ ) η η η Term η : not representing a topic (e. g. “six months’’, or “practical examples’’ )
Detecting a topic via a single term • Given a term A, • Binary decision of whether A represents a topic or not • • H 1 : A represents a topic H 0 : A does not represent a topic GA : The term citation graph for A O(GA) : Link connectivity observation on GA • Finally, a ranked list of terms
Loglikelihood of H 1 • Observation O(GA) : – For each node i in GA, is it connected to other nodes in GA by at least one link? This probability = pc • Under H 1 – pc 1 : estimation of pc – pc 1 set to a value close to 1 (e. g. pc 1 = 0. 9)
Loglikelihood of H 0 • pc 0 : estimation of pc GA ? ?
Evaluation • ar. Xiv – – – A Physics literature collection Year 1991 -2006, 7 major ar. Xiv areas 214, 546 papers, 2, 165, 170 citation links Abstract as document 137, 098 bi-gram terms after low-frequency prune • Citeseer – – – A Computer Science related collection Year 1994 -2004 716, 771 papers, 1, 740, 326 citation links Abstract + title as document 631, 839 bi-gram terms after low-frequency prune
ar. Xiv (physics) : topic terms at top ranks top rank Topic (term) <n, nc, |E|> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Black hole Quantum hall Black holes Higgs boson Renormalization group Quantum gravity Standard model Heavy quark Cosmological constant Quantum dot Chiral perturbation Form factors Lattice qcd String theory Hubbard model <4978, 4701, 38952> <1863, 1493, 4862> <3131, 2896, 22824> <2079, 1896, 12607> <3738, 2920, 8490> <2014, 1724, 9693> <7848, 7145, 53829> <1671, 1473, 6570> <2141, 1815, 7134> <1366, 1031, 2926> <1132, 1050, 5578> <1578, 1354, 5616> <1425, 1265, 5240> <3818, 3539, 26250> <1702, 1167, 2678> n : number of nodes in GA nc : number of nodes with at least one connection within GA |E| : number of edges in GA
ar. Xiv(Physics): Term citation graphs for intermediate rank topic terms time Research communities
ar. Xiv(Physics): terms at bottommost ranks rank term 137098 137097 137096 137095 137094 137093 137092 137091 137090 137089 137088 137087 137086 137085 137084 we show has been we find we present we study we have we also have been we discuss we consider does not our results we investigate into account we propose Bottom entries are stop-phrases
Citeseer(CS): top rank terms Top rank terms from two different time periods • Time up to 1999 • Time since 2000 rank topic (term) up to 1999 topic (term) since 2000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 logic programs model checking semidefinite programming inductive logic petri nets genetic programming interior point kolmogorov complexity automatic differentiation complementarity problems congestion control complementarity problem conservation laws linear logic timed automata situation calculus real-time database motion planning duration calculus volume rendering chain monte association rules sensor networks hoc networks logic programs image retrieval support vector congestion control model checking decision diagrams wireless sensor ad hoc intrusion detection vector machines mobile ad binary decision sensor network energy consumption content-based image semantic web fading channels xml data source separation timed automata
Citeseer: Topic time evolution ``logic programs’’ ``support vector’’ ``sensor networks’’ ``congestion control’’
Citeseer: Topic time evolution ``petri nets’’ ``genetic programming’’ ``association rules’’ ``semantic web’’
Algorithm Extension • To detect topics represented by a single term – Algorithm – Evaluation on ar. Xiv, Citeseer • To detect topics defined by a set of terms – Algorithm – Evaluation on ar. Xiv
Conclusion (poster session : #7) • Topic detection via the correlation between terms and links • Our algorithm (in its evaluation on ar. Xiv, Citeseer) – Effectively discovers topics represented by a single-term or by a set of terms – Identifies stop-phrases as a by-product – Discovers topics in their natural scale – Demonstrates its utility in trend analysis – Shows the association between topic scale and specificity
- Slides: 18