SIGIR 2012 Sim Fusion Extending Sim Fusion Towards

Contents 1. Introduction 2. Problem Definition 3. Optimization Techniques 4. Experimental Results 2

1. Introduction v Many applications require a measure of “similarity” between objects. Citation of

Sim. Fusion: A New Link-based Similarity Measure v Structural Similarity Measure v Page. Rank

Sim. Fusion Overview v Features v Using a Unified Relationship Matrix (URM) to represent

Existing Sim. Fusion: URM and USM v Data Space: v Data Relation (edges) a

Example. Sim. Fusion Similarity on Heterogeneous Domain Trivial Solution !!! S=[1]nxn High complexity !!!

Contributions v Revising the existing Sim. Fusion model, avoiding v non-semantic convergence v divergence

Revised Sim. Fusion v Motivation: Two issues of the existing Sim. Fusion model v

From URM to UAM v Unified Adjacency Matrix (UAM) v Example 10

Revised Sim. Fusion+ v Basic Intuition v replace URM with UAM to postpone “row

Optimizing Sim. Fusion+ Computation v Conventional Iterative Paradigm v Matrix-matrix multiplication, requiring O(kn 3)

Example Assume with v Conventional Iteration: v Our approach: 13

Key Observation v Kroneckor product “⊗”: e. g. v Vec operator: e. g. v

Key Observation v Two important Properties: P 1. P 2. v Our main idea:

Accuracy Guarantee v Conventional Iterations: No accuracy guarantee !!! Question: || S(k+1) – S

Accuracy Guarantee v Arnoldi Decomposition: v k-th iterative similarity v Estimate Error: 17

Example Assume with Given v Arnoldi Decomposition: (1) (2) (3) 18

Edge Update on Dynamic Graphs v Incremental UAM Given old G =(D, R) and

Example Suppose edges (P 1, P 2) and (P 2, P 1) are removed.

Experimental Setting v Datasets v Synthetic data (RAND 0. 5 M-3. 5 M) v

Experiment (1): Accuracy On DBLP and WEBKB SF+ accuracy is consistently stable on different

Experiment (2): CPU Time and Space On DBLP SF+ outperforms the other approaches, due

Experiment (3): Edge Updates Varying δ Inc. SF+ outperformed SF+ when δ is small.

Experiment (4) : Effects of The small choice of imposes more iterations on computing

Conclusions v A revision of Sim. Fusion+, for preventing the trivial solution and the

Slides: 27

Download presentation

SIGIR 2012 Sim. Fusion+: Extending Sim. Fusion Towards Efficient Estimation on Large and Dynamic Networks Weiren Yu 1, Xuemin Lin 1, Wenjie Zhang 1, Ying Zhang 1 Jiajin Le 2, 1 University of New South Wales & NICTA, Australia 2 Donghua University, China

Contents 1. Introduction 2. Problem Definition 3. Optimization Techniques 4. Experimental Results 2

1. Introduction v Many applications require a measure of “similarity” between objects. Citation of Scientific Papers Recommender System (amazon. com) similarity search Web Search Engine (google. com) (citeseer. com) Graph Clustering 3

Sim. Fusion: A New Link-based Similarity Measure v Structural Similarity Measure v Page. Rank [Page et. al, 99] v Sim. Rank [Jeh and Widom, KDD 02] v Sim. Fusion similarity v A new promising structural measure [Xi et. al, SIGIR 05] v Extension of Co-Citation and Coupling metrics v Basic Philosophy v Following the Reinforcement Assumption: The similarity between objects is reinforced by the similarity of their related objects. 4

Sim. Fusion Overview v Features v Using a Unified Relationship Matrix (URM) to represent relationships among heterogeneous data v Defined recursively and is computed iteratively v Applicable to any domain with object-to-object relationships v Challenges v URM may incur trivial solution or divergence issue of Sim. Fusion. v Rather costly to compute Sim. Fusion on large graphs v Naïve Iteration: matrix-matrix multiplication v Requiring O(Kn 3) time, O(n 2) space [Xi et. al. , SIGIR 05] v No incremental algorithms when edges update 5

Existing Sim. Fusion: URM and USM v Data Space: v Data Relation (edges) a finite set of data objects (vertices) Given an entire space v Intra-type Relation carrying info. within one space v Inter-type Relation carrying info. between spaces v Unified Relationship Matrix (URM): v λi, j is the weighting factor between Di and Dj v Unified Similarity Matrix (USM): 6

Example. Sim. Fusion Similarity on Heterogeneous Domain Trivial Solution !!! S=[1]nxn High complexity !!! O(Kn 3) time O(n 2) space 7

Contributions v Revising the existing Sim. Fusion model, avoiding v non-semantic convergence v divergence issue v Optimizing the computation of Sim. Fusion+ v O(Km) pre-computation time, plus O(1) time and O(n) space v Better accuracy guarantee v Incremental computation on edge updates v O(δn) time and O(n) space for handling δ edge updates 8

Revised Sim. Fusion v Motivation: Two issues of the existing Sim. Fusion model v Trivial Solution on Heterogeneous Domain v Divergent Solution on Homogeneous Domain Root cause: row normalization of URM !!! 9

From URM to UAM v Unified Adjacency Matrix (UAM) v Example 10

Revised Sim. Fusion+ v Basic Intuition v replace URM with UAM to postpone “row normalization” in a delayed fashion while preserving the reinforcement assumption of the original Sim. Fusion v Revised Sim. Fusion+ Model Original Sim. Fusion squeeze similarity scores in S into [0, 1]. 11

Optimizing Sim. Fusion+ Computation v Conventional Iterative Paradigm v Matrix-matrix multiplication, requiring O(kn 3) time and O(n 2) space v Our approach: To convert Sim. Fusion+ computation into finding the dominant eigenvector of the UAM A. Pre-compute σmax(A) only once, and cache it for later reuse v Matrix-vector multiplication, requiring O(km) time and O(n) space 12

Example Assume with v Conventional Iteration: v Our approach: 13

Key Observation v Kroneckor product “⊗”: e. g. v Vec operator: e. g. v Two important Properties: 14

Key Observation v Two important Properties: P 1. P 2. v Our main idea: (1) Power Iteration (2) 15

Accuracy Guarantee v Conventional Iterations: No accuracy guarantee !!! Question: || S(k+1) – S || ≤ ? v Our Method: Utilize Arnoldi decomposition to build an order-k orthogonal subspace for the UAM A. Due to Tk small size and almost “upper-triangularity”, Computing σmax(Tk) is less costly than σmax(A). 16

Accuracy Guarantee v Arnoldi Decomposition: v k-th iterative similarity v Estimate Error: 17

Example Assume with Given v Arnoldi Decomposition: (1) (2) (3) 18

Edge Update on Dynamic Graphs v Incremental UAM Given old G =(D, R) and a new G’=(D, R’), the incremental UAM is a list of edge updates, i. e. , v Main idea To reuse and the eigen-pair (αp, ξp) of the old A to compute is a sparse matrix when the number δ of edge updates is small. v Incrementally computing Sim. Fusion+ O(δn) time O(n) space 19

Example Suppose edges (P 1, P 2) and (P 2, P 1) are removed. 20

Experimental Setting v Datasets v Synthetic data (RAND 0. 5 M-3. 5 M) v Real data (DBLP, WEBKB) DBLP WEBKB v Compared Algorithms v Sim. Fusion+ and Inc. Sim. Fusion+ ; v SF, a Sim. Fusion algorithm via matrix iteration [Xi et. al, SIGIR 05]; v CSF, a variant SF, using Page. Rank distribution [Cai et. al, SIGIR 10]; v SR, a Sim. Rank algorithm via partial sums [Lizorkin et. al, VLDBJ 10]; 21

Experiment (1): Accuracy On DBLP and WEBKB SF+ accuracy is consistently stable on different datasets. SF seems hardly to get sensible similarities as all its similarities asymptotically approach the same value as K grows. 22

Experiment (2): CPU Time and Space On DBLP SF+ outperforms the other approaches, due to the use of σ max(Tk) On WEBKB 23

Experiment (3): Edge Updates Varying δ Inc. SF+ outperformed SF+ when δ is small. For larger δ, Inc. SF+ is not that good because the small value of δ preserves the sparseness of the incremental UAM. 24

Experiment (4) : Effects of The small choice of imposes more iterations on computing Tk and vk, and hence increases the estimation costs. 25

Conclusions v A revision of Sim. Fusion+, for preventing the trivial solution and the divergence issue of the original model. v Efficient techniques to improve the time and space of Sim. Fusion+ with accuracy guarantees. v An incremental algorithm to compute Sim. Fusion+ on dynamic graphs when edges are updated. Future Work v Devise vertex-updating methods for incrementally computing Sim. Fusion+. v Extend to parallelize Sim. Fusion+ computing on GPU. 26