Discovering Frequent Subgraphs over Uncertain Graph Databases under

Discovering Frequent Subgraphs over Uncertain Graph Databases under Probabilistic Semantics Zhaonian Zou Hong Gao Jianzhong Li Data and Knowledge Engineering Research Center (DKERC) Harbin Institute of Technology (HIT), China July 26, 2010

Outline • • • Overview Model of uncertain graph data Problem statement Algorithm Experiments Summary and future work

Graph mining is very important Chemical compounds Biological networks Social networks Traffic networks Internet topology VLSI

Uncertainties are inherent in graph data • • Uncertainties are caused by data errors, incompleteness, imprecision, noise, etc. There a large number of uncertain graphs in practice. – Protein-Protein Interaction (PPI) networks – Topologies of wireless sensor networks (WSNs) Probability of the PPI Probability of the wireless existing in practice link working normally 0. 80. 75 67 FET 3 75 3. 0 0. 88 RAD 59 0. 92 98 6. 0 TIF 34 0. 950. 63 9 NTG 1 0. 651 0. 1 4 7 0. 69 SMT 3 39 6. 0 RPC 40

Challenges in mining uncertain graph data • • • Different data models – Graphs + uncertainties – Semantics Existing graph mining problems were defined on certain graph data and do not make sense on uncertain graph data. The computational complexity of uncertain graph mining problems is largely even higher than the counterparts on certain graph data.

Recent work on mining uncertain graph data • • Mining frequent subgraph patterns under expected semantics [CIKM’ 09, TKDE’ 10] Mining top-k maximal cliques [ICDE’ 10]

Uncertain graphs The probability of v 1 existing in practice is 0. 9 v 1 0. 4 0. 7 v 3 0. 9 0. 7 v 2 Uncertain graph The conditional probability of edge (v 1, v 2) existing in practice while v 1 and v 2 exist is 0. 9. 0. 8 By tossing a biased coin for each vertex, we obtain a subset V’ of vertices. Then, by tossing a biased coin for each edge between the vertices just selected, we get a subset E’ of edges. Thus, a certain graph (V’, E’) is obtained.

Uncertain graphs 0. 9 v 1 0. 4 0. 7 v 3 0. 9 0. 7 v 2 Uncertain graph v 3 v 2 0. 8 Implicated graph

Uncertain graphs v 1 0. 9 v 2 v 3 v 1 v 1 v 2 v 3 v 1 0. 4 0. 7 v 3 0. 9 0. 7 v 2 Uncertain graph v 1 v 2 v 3 v 3 0. 8 v 1 v 3 v 1 v 2 v 3 v 1 v 2 v 3 v 2 An uncertain graph represents a probability distribution over all its implicated graphs. v 3 v 2

Uncertain graph databases D = {G 1, G 2, …, Gn} Uncertain graph database D’ = {G’ 1, G’ 2, …, G’m} Implicated graph database There is an injection such that An uncertain graph database represents a probability distribution over all its implicated graph databases. .

Frequent subgraph pattern (FSP) mining problem • • The support of a subgraph pattern S in a certain graph database D is the proportion of certain graphs in D that contains S, denoted by sup. D(S). A subgraph pattern S is frequent if the support of S is no less than a user-specified threshold 0 < minsup < 1. • • Input: a certain graph database D and a threshold 0 < minsup < 1 Output: all subgraph patterns in D with support no less than minsup • The concept of support does not make sense on uncertain graph data since a subgraph pattern is not certainly contained in an uncertain graph.

FSP mining problem on uncertain graph databases under probabilistic semantics • • Let Imp(D) denote the set of all implicated graph databases of an uncertain graph database D. The φ-frequent probability of a subgraph pattern S in an uncertain graph database D is the probability of S having support no less than φ across all implicated graph databases of D, denoted Pr. D, φ(S). Input: an uncertain graph database D, a support threshold 0 < φ < 1 and a confidence threshold 0 < τ < 1 Output: all subgraph patterns in D with φ-frequent probability no less than τ

How hard is the FSP mining problem? • • • #P is a complexity class for enumeration problems such as DNF Counting, Hamiltonian Circuit Counting, Perfect Matching Counting, etc. It is #P-hard to count the number of frequent subgraph patterns in an uncertain graph database. – Polynomial-time reducible from the problem of counting the number of frequent subgraphs in a certain graph database [Yang 04] It is #P-hard to compute the φ-frequent probability of a subgraph pattern in an uncertain graph database. – Polynomial-time reducible from the Monotone k-DNF Counting problem [Valiant 79] All existing algorithms for mining FSPs on certain graph databases can not solve this problem. Approximate mining is an important approach when small errors are irrelevant.

Goal of approximate mining Subgraph patterns -frequent probability 1. 0 0. 5 It is intractable to exactly compute all frequent subgraph patterns. 0 The -frequent probabilities must be no less than.

Overview of mining algorithm B B 0. 8 0. 5 y A x x 0. 7 A B x y 0. 6 B Uncertain graph G 1 B 0. 8 0. 1 Organize all patterns into a search If Ssubgraph isomorphic to S’, tree according to their DFS codes [Yan. & Han ICDM’ 01]. then y 0. 7 z Uncertain graph G 2 The key of the algorithm is fast determining whether the phifrequent probability of a subgraph pattern must be no less than and probably no less than. B

Method for verifying subgraph patterns • • Step 1: Approximate the φ-frequent probability of S by an interval [l, u] having width at most ε. Step 2: Test the following conditions to determine whether to output S or discard it. Output Discard 0 1 -frequent probability

Dynamic programming for exactly computing φfrequent probabilities • • • Let T[0. . n, 0. . n] be a three-dimensional table. T[i, j, k] memoires the probability that an implicated graph database of {G 1, G 2, …, Gk} contains i + j certain graphs and that S is subgraph isomorphic to i certain graphs in it. Recursive equation (general case) – • is the probability of S occurring in G. We obtain Pr. D, φ(S) by summing up all T[i, j, n] such that i/(i + j) ≥ φ.

Dynamic programming for exactly computing φfrequent probabilities Let n = 3 and φ = 0. 5. Give n, φ, • and as input. It is #P-hard to compute it [Zou et al. TKDE’ 10]. Substitute • with an estimated value . i 0 1 2 3 0 j 1 2 3 k=0 k=1 k=2 k=3

Making dynamic programming practical • A randomized algorithm has been proposed in [Zou et al. TKDE’ 10] to compute an estimated value in polynomial time for any 0 < ε, δ < 1 such that • To guarantee the output of the dynamic programming is within error ε with probability at least 1 – δ, how accurate should be? – Within error ε/2 n. – Succeed with probability at least (1 – δ)1/n.

Algorithm for computing approximate intervals of φ -frequent probabilities • • Preprocessing: For i = 1 to n, compute at the beginning of the algorithm. Step 1: For i = 1 to n, compute an estimated value of that is within error ε/2 n with probability at least (1 – δ)1/n. Step 2: Compute an estimated value X of Pr. D, φ(S) using the dynamic programming method with input n, φ, and. Step 3: Return [l, u] = [X – ε/2, X + ε/2]. Time complexity: O(n 3 m 2 s ln(2 n/δ)/ε 2) |u – l| ≤ ε and Pr(l ≤ Pr. D, φ(S) ≤ u) ≥ 1 – δ

Theoretical guarantees of the mining algorithm Any frequent subgraph pattern S is output as a result with probability at least ((1 – )/2)s, where s is the number of edges of S. Any infrequent subgraph pattern S with -frequent probability less than is output as a result with probability at most.

How to set parameter δ? • To guarantee any frequent subgraph pattern to be output as a result with probability at least 1 – Δ, parameter δ should be at most 1 – 2·(1 – Δ)1/ℓ, where ℓ is the maximum number of edges of frequent subgraph patterns.

Experiments • • Test execution time and approximation quality. Dataset – Source: the Bio. GRID database and the STRING database – PPI networks of six organisms Organism # of vertices # of edges Average prob. Fission yeast 162 300 0. 148 Fruit fly 3751 7384 0. 456 House mouse 199 286 0. 413 Rat 130 178 0. 374 Thale cress 513 1168 0. 444 Worm 514 960 0. 190

Execution time vs. φ and τ (ε = δ = 0. 05) The execution time of the algorithm rapidly decreases as φ and τ gets larger because the number of subgraph patterns need to be examined by the algorithm significantly decreases with the increasing of φ and τ.

Execution time vs. ε and δ (φ = 0. 2, τ = 0. 9) The execution time of the algorithm rapidly decreases as ε and δ gets larger because the number of subgraph patterns need to be examined by the algorithm does not vary significantly, but the running time of the procedure for computing the approximate interval of Pr. D, φ(S) is O(n 3 m 2 s ln(2 n/δ)/ε 2).

Approximation quality vs. ε (δ = 0. 05) • • Precision: the proportion of frequent ones in output subgraph patterns Recall: the proportion of output ones in frequent subgraph patterns • • • #OFS: the number of output frequent subgraph patterns #OIS: the number of output infrequent subgraph patterns #FS: the number of frequent subgraph patterns (ε = 0. 02, δ = 0. 001) • • Since precision = #OFS/(#OFS + #OIS), it decreases as ε gets larger. Since recall = #OFS/#FS, it is stable and almost independent of ε.

Approximation quality vs. δ (ε = 0. 05) • • The precision of the algorithm is almost independent of δ. In theory, the recall of the algorithm should decrease as δ increases. However, the experimental results is counterintuitive. This is because the practical failure probability of the algorithm for computing the approximate interval of Pr. D, φ(S) is much lower than its theoretical bound.

Summary • • Model of uncertain graph data – An uncertain graph represents a probability distribution over all its implicated graphs. – An uncertain graph database represents a probability distribution over all its implicated graph databases. FSP mining problem on uncertain graph databases under probabilistic semantics Hardness results – This FSP mining problem is NP-hard. – It is #P-hard to compute the φ-frequent probability of a subgraph pattern. Algorithm – A dynamic programming-based randomized algorithm for computing approximate intervals of φ-frequent probabilities – Thorough analysis on global and/or local theoretical guarantees

Future work • • De-randomize the proposed algorithm Qualitative evaluation of mining results Succinct models of uncertain graph data …

References 1. Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang. Mining Frequent Subgraph Patterns from Uncertain Graph Data. TKDE, 2010. 2. Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang. Finding Top-k Maximal Cliques in an Uncertain Graph. ICDE, 2010. 3. Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang. Frequent Subgraph Pattern Mining on Uncertain Graph Data. CIKM, 2009. Thank you! See you in today’s poster session. For more information, please visit our group at http: //db. cs. hit. edu. cn.