Link Prediction in Social Networks Laks V S

The Problem �Examples: Co authorship graphs, scientists serving on common PC/editorial board, business leaders

Motivation & Significance �Model for Network evolution. �Predict likely interactions, not explicitly observed, based

Main Contributions � Features intrinsic to n/w can offer a good measure of how

Link Prediction Problem Formalized �G[t 0, t 0’] = the multi graph of social

Example John Mary Hui Asif Ram John Nita Mary Asif Ram Hui • Say

Some simple notions �Triadic closure [Easley & Kleinberg 2010]: if x and y have

Various predictors Γ(u) = neighbors of u in G[t 0, t 0’]. �score(u, v)

More predictors ◦ Adamic/Adar: ∑zЄΓ(u)пΓ(v) 1/log|Γ(z)|. ◦ Preferential attachment: |Γ(u)|x|Γ(v)|. �Based on ensemble of

The Katz Score �Katz: ∑i βi|pathsi(u, v)|, where �Dealing with multi-graphs: 0<β<1. ◦ Unweighted

Hitting & Commute Time �Hitting & Commute Time: Perform a random walk, starting at

Rooted Page. Rank �Rooted PR: πv in the following RW: At any step, ◦

Sim. Rank � Sim. Rank: SR(u, u) = 1. When u != v, SR(u,

Predictors Summary �Neighborhood based (easy to compute) and Path based (more accurate). �Output sorted

Meta measures �score*(u, v) = |Γ(v) п Top-K(u)|, for some K. //how many of

Conclusions �While many predictors significantly outperform random, their precision can use considerable improvement (e.

Recommended Papers ◦ Link Prediction: & SN evolution ◦ Al Hasan et al. Link

Recommended Papers � � � � � Computational Aspects: Kurt C. Foster, Stephen Q.

Recommended Papers � Surveys & Some Theory: � Lise Getoor and Christopher P. Diehl.

Slides: 19

Download presentation

Link Prediction in Social Networks Laks V. S. Lakshmanan (based on Kleinberg and Liben Nowell. The Link Prediction Problem for Social Networks. CIKM 2003. )

The Problem �Examples: Co authorship graphs, scientists serving on common PC/editorial board, business leaders working on common board, etc. �Can current state of network be used to predict future links? �Note: factors extraneous to n/w (need to be) ignored (e. g. , they met on a beach).

Motivation & Significance �Model for Network evolution. �Predict likely interactions, not explicitly observed, based on observed links – useful for terrorist network monitoring. �Link prediction is one instance of Social Network Analysis. �Application for “Friend” suggestion in online SN. Notice, this takes on a flavor of link recommendation.

Main Contributions � Features intrinsic to n/w can offer a good measure of how likely a future link is. � Can significantly outperform chance. � Need more sophisticated measures than direct ones (e. g. , geodesic distances). � Note: Extra network features can significantly improve prediction accuracy: e. g. , keywords describing interests of each scientist, or keywords extracted from their papers/abstracts/titles. Not considered here. ◦ See Al Hasan et al. SDM workshop 2006. http: //www. cs. rpi. edu/~zaki/Paper. Dir/LINK 06. pdf ◦ Ben Taskar et al. Link prediction in relational data. In Proceedings of Neural Information Processing Systems, 2004. ◦ A. Popescul and L. Ungar. Statistical relational learning for link prediction. In Workshop on Learning Statistical Models from Relational Data at IJCAI, 2003.

Link Prediction Problem Formalized �G[t 0, t 0’] = the multi graph of social interactions between actors (users) during the training interval [t 0, t 0’]. ◦ Separate edge for each interaction (paper, phone call, email, . . . ) with timestamp. �Predict edges in test interval [t 1, t 1’], where t 1>t 0’. �E_old = edges present in G[t 0, t 0’]. �E_new = edges in G[t 1, t 1’] but not in G[t 0, t 0’]. �Refinement: Restrict to those nodes adjacent to >=κ edges in G[t 0, t 0’] and to >=κ’ edges in G[t 1, t 1’]. Call them the core nodes.

Example John Mary Hui Asif Ram John Nita Mary Asif Ram Hui • Say κ = κ’ = 2. • Only evaluate how well we do on {John, Ram, Asif, Mary}. • E*_new = {(J, M), (R, M)}, when restricted to the “core” nodes above. • Evaluation: among the topmost likely edges predicted, how well we do on precision and recall. • Precision = analog of soundness. • Recall = analog of completeness.

Some simple notions �Triadic closure [Easley & Kleinberg 2010]: if x and y have a common friend, then the likelihood of x, y being friends increases. �Clustering coefficient of node x = fraction of its neighbors that are connected by a link. �Bearman and Moody’s study: teenage girls who have a low clustering coefficient in their network of friends are significantly more likely to contemplate suicide than those whose clustering

Various predictors Γ(u) = neighbors of u in G[t 0, t 0’]. �score(u, v) = score that indicates likelihood of (u, v) being in E*_new. �Neighborhood-based: �Notation: ◦ Shortest distance: (negated) length of shortest path (geodesic). ◦ Common neighbors: Γ(u) п Γ(v). ◦ Jaccard coefficient: |Γ(u)пΓ(v)|/|Γ(u)υΓ(v)|.

More predictors ◦ Adamic/Adar: ∑zЄΓ(u)пΓ(v) 1/log|Γ(z)|. ◦ Preferential attachment: |Γ(u)|x|Γ(v)|. �Based on ensemble of paths: ◦ Katz, Hitting/Commute Time, Rooted Page. Rank, Sim. Rank.

The Katz Score �Katz: ∑i βi|pathsi(u, v)|, where �Dealing with multi-graphs: 0<β<1. ◦ Unweighted boolean. ◦ Weighted count #interactions b/w neighbors. �Shorter paths more affinity. �More paths more affinity. �K = βA + β 2 A 2 +. . . = I – (I – βA)-1, where A = adjacency matrix of graph. �Notice computational challenges. �Fact: For convergence, need β < [1/(largest eigenvalue of A)].

Hitting & Commute Time �Hitting & Commute Time: Perform a random walk, starting at u, follow neighbor of current node w. p. 1/deg(curr). ◦ Adapted in the obvious way if edges are weighted. �Hu, v=expected time for walk to reach v. ◦ Not symmetric. �Cu, v=Hu, v+Hv, u. �πx=stationary distribution weight �Stationary normed versions: ◦ Hu, vπv+Hv, uπu. �Score, of x. in all cases, negative of measure. (Why? )

Rooted Page. Rank �Rooted PR: πv in the following RW: At any step, ◦ Jump to u with probability p. ◦ Move to one of the neighbors of current node at random with probability 1 p. �Avoids some pitfalls of HT/CT by minimizing dependence on nodes far away from both u and v. �Computational challenge (of RPR, HT, CT) – similar to Page. Rank.

Sim. Rank � Sim. Rank: SR(u, u) = 1. When u != v, SR(u, v) = γ. ∑wЄΓ(u)∑xЄΓ(v)SR(w, x)/|Γ(u)||Γ(v)|, where 0<γ<1. SR(u, v) = 0 when Γ(u) u Γ(v) = . RW interpretation: E[γi], where i=earliest time at which RWs started at u and v meet for the first time. Qn: Can the RW interpretation be leveraged to exploit Page. Rank-like techniques to compute Sim. Rank?

Predictors Summary �Neighborhood based (easy to compute) and Path based (more accurate). �Output sorted in descending score order of pages. �Consider top n pages among core nodes for evaluation. �A performance trick, which pays off with quality as well: use low rank approximation Ak of adjacency matrix A. �Works very well since usually A is extremely sparse.

Meta measures �score*(u, v) = |Γ(v) п Top-K(u)|, for some K. //how many of your neighbors are my nearest score-neighbors? // �score*(u, v) = ∑zЄΓ(v)п. Top-K(u)score(u, z). //what is the total strength of my relationship with my nearest scoreneighbors who are your neighbors? // unseen bigrams.

Conclusions �While many predictors significantly outperform random, their precision can use considerable improvement (e. g. , < 20%). �Need for better prediction algorithms: ◦ Perhaps more recent training data is more important? ◦ Combining measures (e. g. , using Bayesian classifier). ◦ Leveraging node properties. �Need fast algorithms for approximating measures.

Recommended Papers ◦ Link Prediction: & SN evolution ◦ Al Hasan et al. Link prediction using supervised learning. SDM 2006 Workshop. ◦ Ben Taskar et al. Link prediction in relational data. In Proceedings of Neural Information Processing Systems, 2004. ◦ A. Popescul and L. Ungar. Statistical relational learning for link prediction. In Workshop on Learning Statistical Models from Relational Data at IJCAI, 2003. ◦ Jure Leskovec, Lars Backstrom, Ravi Kumar, and Andrew Tomkins. Microscopic evolution of social networks. In Proceedings of the 14 th ACM International Conference on Knowledge Discovery and Data Mining (KDD'08). ◦ Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the 11 th ACM International Conference on Knowledge Discovery and Data Mining (KDD'05).

Recommended Papers � � � � � Computational Aspects: Kurt C. Foster, Stephen Q. Muth, John J. Potterat, and Richard B. Rothenberg. A faster katz status score algorithm. Computational & Mathematical Organization Theory, 7(4): 275 285, 2001. Purnamrita Sarkar and Andrew W. Moore. A tractable approach to finding closest truncated commute time neighbors in large graphs. In Proceedings of the 23 rd Conference on Uncertainty in Artificial Intelligence (UAI'07). Purnamrita Sarkar, Andrew W. Moore, and Amit Prakash. Fast incremental proximity search in large graphs. In Proceedings of the 25 th International Conference on Machine Learning (ICML'08). Glen Jeh and Jennifer Widom. Simrank: a measure of structural context similarity. In Proceedings of the Eighth ACM International Conference on Knowledge Discovery and Data Mining (KDD'02). Dmitry Lizorkin et al. Accuracy estimate and optimization techniques for Sim. Rank computation. VLDB Journal, 2008. Daniel Fogaras, et al. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics, 2(3), 2005. Peixiang Zhao, Jiawei Han, and Yizhou Sun. P Rank: a comprehensive structural similarity measure over information networks. CIKM 2009. P. Zhao, J. Han, and Y. Sun. P rank: a comprehensive structural similarity measure over information networks. In. CIKM, pages 553{562, 2009. Pei Lee, Laks V. S. Lakshmanan and Jeffrey Xu Yu. On Top k Structural Similarity Search. ICDE 2012.

Recommended Papers � Surveys & Some Theory: � Lise Getoor and Christopher P. Diehl. Link Mining: A Survey. SIGKDD Explorations 2005. � Linyuan Lü and Tao Zhou. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications. Volume 390, Issue 6, 15 March 2011, Pages 1150– 1170. � Mohammad Al Hasan and Mohammed J. Zaki. A Survey of Link Prediction in Social Networks. Social Network Data Analytics 2011, pp. 243 275. � Purnamrita Sarkar, Deepayan Chakrabarti, and Andrew W. Moore. 2011. Theoretical justification of popular link prediction heuristics. In Proceedings of the Twenty. Second international joint conference on Artificial Intelligence - Volume Three (IJCAI'11), Toby Walsh (Ed. ), Volume Three. AAAI Press 2722 2727.