Association Analysis 7 Mining Graphs Frequent Subgraph Mining

Association Analysis (7) (Mining Graphs)

Frequent Subgraph Mining • Extend association rule mining to finding frequent subgraphs • Useful for Web Mining, computational chemistry, spatial data sets, etc Homepage Teaching Databases Data Mining

Bio/Chem-Informatics • Each year, new chemical compounds are designed. • We know that structure of a compound plays a big role in its chemical properties. • However, it is difficult to establish their exact relationship. • Frequent subgraph mining can aid by identifying the substructures commonly associated with certain properties of known compounds.

Web mining • E. g. Mining the DBLP Web Graph Widom Jeff Ullman Calvanese Vardi Garcia-Molina Alfred Aho A mined subgraph Lenzerini Kuperferman Two examples of matches

Graph Definitions

Mining Subgraphs

The Exhaustive Way…Listing all. . .

Apriori-Like Approach • Support: – number of graphs that contain a particular subgraph • Apriori principle still holds • Level-wise (Apriori-like) approach: – Vertex growing: • k is the number of vertices – Edge growing: • k is the number of edges

Apriori-Like Algorithm • Generate candidate – Merge pairs of frequent (k - 1)-subgraphs to obtain a candidate ksubgraphs. • Prune candidates – Discard all candidate k-subgraphs that contain infrequent (k - l)subgraphs. • Count support – Counting the number of graphs in DB that contain each candidate. – Discard all candidate subgraphs whose support counts are less than minsup.

Vertex Growing r The resulting matrix is the first matrix, appended with the last row and last column of the second matrix. The remaining entries of the new matrix are either zero or replaced by all valid edge labels connecting the pair of vertices.

Edge Growing Edge growing inserts a new edge to an existing frequent subgraph during candidate generation. Doesn’t necessarily increase the number of vertices in the original graphs.

Topological equivalence Two vertexes are topologically equivalent if they have: 1. The same label and 2. The same number and label of edges incident to them. v 1, v 2, v 3, v 4 are topologically equivalent v 1, v 4 are topologically equivalent v 2, v 3 are topologically equivalent No topologically equivalent vertexes

Multiplicity of Candidates Case 1 a: v v’ , v 1 v 2 (Topologically in the (k-2)-graphs) v a q b p r v 1 p e c + a q b p r a p e v 2 d q c q b p r c d q v’ Core: The (k-2)-edge subgraph that is common between the joint graphs We try to map the cores.

Multiplicity of Candidates Case 1 b: v v’ , v 1=v 2 (Topologically in the (k-2)-graphs) a p e q v a q b p r v 1 p e + a q b c v’ p r v 2 b p e q r c a c q b p r c e q p e q

Multiplicity of Candidates Case 2 a: v v’ , v 1 v 2 (Topologically in the (k-2)-graphs) v v’ a q b p r p e c a v 1 + q b p r q d c a p e q v 2 q b p r d c

Multiplicity of Candidates Case 2 b: v v’ , v 1=v 2 (Topologically in the (k-2)-graphs) q v v’ a q b p r p e c a v 1 + q b p r q v 2 b a p e q r p e c e a p e c q b p q r c

Multiplicity of Candidates Case 2 c: v v’ (Topologically in the (k-2)-graphs) a p e v q a q b q r p e a + q b a q r q b d r q a a a p e q v’ q We try to map the cores, and there two ways to do this. q d b q r d a

Multiplicity of Candidates Case 2 d: v v’ (Topologically in the (k-2)-graphs) a p e v q a q b q r p e a + q b a q r q b e r q q b a q q r a a a p e q v’ q We try to map the cores, and there two ways to do this. q e a p e b q r e a b a p e q q r q a

Multiplicity of Candidates More than two topologically equivalent vertexes b c a a a a a b + c b a a a c a a a Core: The (k-2) subgraph that is common between the joint graphs c a a a b

Adjacency Matrix Representation A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1 0 A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 1 1 1 0 • The same graph can be represented in many ways

Graph Isomorphism • A graph G 1 is isomorphic to another graph G 2, if G 1 is topologically equivalent to G 2 • Test for graph isomorphism is needed: – During candidate generation, to determine whether a candidate can be generated – During candidate pruning, to check whether its (k-1)-subgraphs are frequent – During candidate counting, to check whether a candidate is contained within another graph, we should use more specialized algorithms (possibly using indexes with each frequent (k-1) subgraph)

Codes A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 0 1 0 1 1 Code =1 10 011 1000 01001 001010 0001011 A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8) 0 1 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 Code =10110100000100110001110

Graph Isomorphism • Use canonical labeling to handle isomorphism – Map each graph into an ordered string representation (known as its code) such that two isomorphic graphs will be mapped to the same canonical encoding • Example: – Choose the string representation with the lowest Lexicographical value • Then, the graph isomorphism problem can be solved by string matching.