g Approx Mining Frequent Approximate Patterns from a

  • Slides: 27
Download presentation
g. Approx: Mining Frequent Approximate Patterns from a Massive Network Cheny, Xifeng Yanz, Feida

g. Approx: Mining Frequent Approximate Patterns from a Massive Network Cheny, Xifeng Yanz, Feida Zhuy, Jiawei Han [ICDM 2007] reporter: Che-Wei, Liang 10/16 1

Outline • Introduction • Problem Formulation • Algorithm – Pattern Space Exploration – Support

Outline • Introduction • Problem Formulation • Algorithm – Pattern Space Exploration – Support Counting • Experiment • Conclusions 2

Introduction • A set of graphs vs. a single network • Recently, a large

Introduction • A set of graphs vs. a single network • Recently, a large number of graphs with massive sizes and complex structures in many applications. – Biological networks, social networks, Web. – demanding powerful data mining methods. • Now interested in patterns that frequently appear at many different places of a single network. 3

Introduction • Protein-Protein Interaction (PPI) network △= degree of approximation = 5 4

Introduction • Protein-Protein Interaction (PPI) network △= degree of approximation = 5 4

Two major complications 1. Mining frequent patterns in a single network – Partition it

Two major complications 1. Mining frequent patterns in a single network – Partition it into regions – Each contains one occurrence of the pattern 2. Due to various inherent noise or data diversity, it is crucial to account for approximations so that all potentially interesting patterns can be captured. 5

Outline • Introduction • Problem Formulation • Algorithm – Pattern Space Exploration – Support

Outline • Introduction • Problem Formulation • Algorithm – Pattern Space Exploration – Support Counting • Experiment • Conclusion 6

Problem Formulation 7

Problem Formulation 7

Approximate Pattern Occurrences • Injective function m: Vp → VG mapping each vertex v

Approximate Pattern Occurrences • Injective function m: Vp → VG mapping each vertex v Vp to m(v) VG • Quantify the degree of approximation m incurs i. e. , approximations can only happen within the matchable list. 8

Approximate Pattern Occurrences 9

Approximate Pattern Occurrences 9

Approximate Pattern Occurrences 10

Approximate Pattern Occurrences 10

Approximate Pattern Occurrences 11

Approximate Pattern Occurrences 11

Pattern Support with Approximation 12

Pattern Support with Approximation 12

Pattern Support with Approximation 13

Pattern Support with Approximation 13

Pattern Support with Approximation 14

Pattern Support with Approximation 14

Outline • Introduction • Problem Formulation • Algorithm – Pattern Space Exploration – Support

Outline • Introduction • Problem Formulation • Algorithm – Pattern Space Exploration – Support Counting • Experiment • Conclusion 15

Algorithm • Two major issues: 1. Pattern Space Exploration 2. Support Counting – Enumerate

Algorithm • Two major issues: 1. Pattern Space Exploration 2. Support Counting – Enumerate approximate occurrences of each pattern in the network. – Decide the maximal number of disjoint occurrences. 16

Pattern Space Exploration • Decompose pattern space – Find all connected vertex sets in

Pattern Space Exploration • Decompose pattern space – Find all connected vertex sets in G that contain 1. – Remove 1 from G, and find all connected vertex sets in the new graph G’ that contain 2. – And so on so forth … 17

Pattern Space Exploration • Example: Generating all connected vertex sets starting from 1. Stage

Pattern Space Exploration • Example: Generating all connected vertex sets starting from 1. Stage 1. Start from 1 and mark 1. Stage 2. Expand from 1 to reach 2, 5, 6. Mark 2, 5, 6. There are totally seven connected vertex sets in this stage. {1, 2}, {1, 5}, {1, 6}, {1, 2, 5}, {1, 2, 6}, {1, 5, 6}, {1, 2, 5, 6} Stage 3. Taking each of the seven connected vertex sets in stage 2 as a starting point, continue expansion. Stage 4. Until there are no more unmarked vertices. 18

19

19

20

20

21

21

Theorem 1 Explore() in Algorithm 1 is both complete and redundancy-free, i. e. ,

Theorem 1 Explore() in Algorithm 1 is both complete and redundancy-free, i. e. , given a network G (1) it only generates connected vertex sets in G. (2) it can generate all connected vertex sets in G. (3) it does not generate the same connected vertex set more than once. 22

Support Counting • A pattern P’s support is defined to be the maximal number

Support Counting • A pattern P’s support is defined to be the maximal number of “disjoint” ones that can be chosen from P’s approximate occurrences in the network. — NP-Complete maximal independent set. • Use algorithm 2 can provide an upperbound. 23

Support Counting 24

Support Counting 24

g. Approx • g. Approx – Combine with pattern space exploration and support counting.

g. Approx • g. Approx – Combine with pattern space exploration and support counting. – Conditional branch on the 3 rd line of Algorithm 1’s DFS_horizontal() function. 25

Experiment 26

Experiment 26

Conclusions • Give an approximation measure and show its impact on mining. – count

Conclusions • Give an approximation measure and show its impact on mining. – count a pattern’s support based on its approximate occurrences in the network. • The techniques is general – can be applied to networks from other domains. • Can be modified – to reach bigger, more interesting patterns even faster – with some sacrifice on the completeness of mining results. 27