Advisoradvisee Relationship Mining from Research Publication Network Chi

Advisor-advisee Relationship Mining from Research Publication Network Chi Wang 1, Jiawei Han 1, Yuntao Jia 1, Jie Tang 2, Duo Zhang 1, Yintao Yu 1, Jingyi Guo 2 1 University of Illinois at Urbana-Champaign {chiwang 1, hanj, yjia 3, dzhang 22, yintao}@illinois. edu 2 Tsinghua University {jietang, guojy 07@mails}. tsinghua. edu. cn

Motivation • Latent knowledge in information network: – Relationships: friends/relatives/colleagues/enemies? • If they can be mined by links, it will benefit our study in – Community structure clustering & classification – Exerting Searching search & ranking – Evolution patterns prediction & recommendation

Overall Framework

Overall Framework ai: author i pj: paper j py: paper year pn: paper# sti, yi: starting time • edi, yi: ending time • ri, yi: ranking score • • •

Heuristics • ASSUMPTION 1: at each time t during the publication history of a node x, x is either being advised or not being advised. Once x starts to advise another node, it will never be advised again. • ASSUMPTION 2: for a given pair of advisor and advisee, the advisor always has a longer publication history than the advisee.

Stage 1: Preprocessing • From author-paper bipartite network to authorship collaboration homogenous network. • Then a filtering process is performed to remove unlikely relations of advisor-advisee.

Stage 1: Preprocessing • Author aj is not considered to be ai’s advisor if one of the following conditions holds:

Stage 1: Preprocessing • In addition, estimate: – the starting time st is estimated as the time they started to collaborate; – the ending time ed can be estimated as either the time point when the Kulczynski measure starts to decrease; – the local likelihood of aj being ai’s advisor lij ij ij

Stage 2: Graph Factor Model • For each node ai, there are three variables to decide: yi, sti, and edi. Suppose we have already had a local feature function g(yi, sti, edi) defined on the three variables of any given node.

Experiment Results • DBLP data: 654, 628 authors, 1076, 946 publications, years provided. Datasets RULE SVM Ind. MAX TPFG TEST 1 69. 9% 73. 4% 75. 2% 78. 9% 80. 2% 84. 4% TEST 2 69. 8% 74. 6% 79. 0% 81. 5% 84. 3% TEST 3 80. 6% 86. 7% 83. 1% 90. 9% 88. 8% 91. 3% heuristics Supervised learning Empirical optimized parameter

Case Study Advisee Top Ranked Advisor Time Note David M. Blei 1. Michael I. Jordan 01 -03 Ph. D advisor, 2004 grad 2. John D. Lafferty 05 -06 Postdoc, 2006 Hong Cheng 1. Qiang Yang 02 -03 MS advisor, 2003 2. Jiawei Han 04 -08 Ph. D advisor, 2008 1. Rajeev Motawani 97 -98 “Unofficial advisor” Sergey Brin

Effect of rules - ROC curve • Filtering rules in TPFG 12

THANK YOU