Mining Distributed Databases Raj Bhatnagar University of Cincinnati

Distributed Databases D = D 1 X D 2 X. . . X Dn A B C C D E A E G D 1 D 2 Dn - D is implicitly specified Geographically distributed nodes Goal: Discover patterns in implicit D, using the explicit Di’s Limitations: - Can’t move Di’s to a common site - Size / communication cost/Privacy - Can’t update local databases - Can’t send actual data tuples

Explicit and Implicit Databases Node 1 Node 2 Implicit Database Node 3 A B C D C F A E C 1 2 2 1 1 2 1 6 2 1 1 6 1 1 1 3 2 2 1 2 6 1 - - - Explicit Component Databases A 1 1 2 2 A B C D E 1 6 1 1 2 1 6 2 1 1 F 1 3 2 1 2 2 2 1 3 C 1 2 1 Shared. Set 2 2 6 6 2 1 1 1 2 2

Decomposition of Computations A B C - Since D is implicit, - For a computation: - Decompose F into G and g’s - Decomposition depends on - F - Di’s - Set of shared attributes C D E D 1 D 2 A E G Dn

Decomposition of Computations Computational primitives – Arithmetic primitives • Count of tuples in implicit D • Mean Value of an attribute in D • Informational entropy for a subset of D • Covariance matrix for D – non-numeric primitives • Median value of an atribute in D • Sorting subsets of tuples in D

Decomposition of Computations • Computational cost of decomposition – Communication cost • Number of messages exchanged – Number of database queries • Who does the decomposition? – Algorithm itself, at run time – Depending on the nature of overlap in Di’s

Count All Tuples in Implicit D Can be decomposed as: – cond. J : Jth tuple in Shareds – n: number of participating databases (Dis) – (N(Dt)cond. J): count of tuples in Dt satisfying cond. J – Local computation: gi(Di, ) = N(Dt)cond. J – G is a sum-of-products Shareds A 1 1 2 2 C 1 2 L attributes; k values each; tuple s

Implementing Decomposed Computations Stationary Agents Mobile Agents Aglet A B C C D E D 1 D 2 A A A E G A B C C D E Dn Dx A Messages A D 1 D 2 A E G Dn Dx

Implementation of Count(D) Stationary Agents - Request / Send Summaries - Simple SQL interface - 1 count / message - l attributes having k values each A B C C D E D 1 D 2 A E G Dn Messages exchanged: - Query-code interface counts/message - l attributes having k values each Messages exchanged: Mobile Agents: Number of hops: Shareds A 1 1 2 2 C 1 2 L attributes; k values each; tuple s

Implementation of Count(D-test) Stationary Agents - Simple SQL interface Messages exchanged: - Query-code interface Messages exchanged: Mobile Agents: Number of hops: Shareds A 1 1 2 2 C 1 2 L attributes; k values each; tuple s

Average Value of an attribute in D Compute counts for each value of an attribute: Stationary Agents - Simple SQL interface Messages exchanged: (1 integer/message) - Query-code interface Messages exchanged: Mobile Agents: Number of hops: integers/message

Exception Tuples • Database of interest may exclude some tuples of D • Learning site keeps a relation E of exception tuples – E may have explicit tuples – E may have rules to generate exception tuples Node 1 Node 2 Node 3 A B C D C F A E C 1 2 2 1 1 2 1 6 2 1 1 6 1 1 1 3 2 2 1 2 6 1 - - - Explicit Databases A 1 1 2 2 C 1 2 Shared. Set B 2 - E 3 - Exceptions

Computing Informational Entropy Consists of various counts only: Stationary agent/Simple SQL interface: Messages exchanged: (1 integer/message) Stationary agent/Query-code interface: Messages exchanged: integers/message Mobile agent: Number of hops: [Number of messages/hops is independent of the size of D]

Decomposition of Algorithms • Arithmetic primitives are 1 -step decompositions – Counts, averages, entropy • Algorithms involve – Arithmetic primitives – non-numeric primitives – Control structure • Decomposition studied for – Decision tree induction algorithm – Mining of association rules • Control structure is unaltered • Primitive computations are decomposed A B C C D E D 1 D 2 A E G Dn Dx • Learner Node • Control structure • Decomposition • Composition

Building a Decision Tree To induce a decision tree having: - d levels; m attributes in n databases; - k values/attribute l shared attributes Stationary agent/Simple SQL interface: Messages exchanged: (1 integer/message) Stationary agent/Query-code interface: integers/message Messages exchanged: Mobile agent: hops [Number of messages/hops is independent of the size of D]

Mining Association Rules Main operations: - Enumerate item-sets - Compute support/confidence - Basic computation: Count-of-tuples Communication Complexity: - m (avg. ) item sets at each level of enumeration tree - j levels of enumeration tree Number of Counts Needed: - Query-code can count for all item sets at a level simultaneously - Therefore, we need: Messages, or hops

More Complex Computations • Covariance matrix for D – Useful for eigen vectors/principal components – Needs second order moments • Graph/Network algorithms – Each node has part of a graph – Some nodes are shared • Determine MST • Paths of Min/Max flow • flow patterns

Sum of Products • Sum of products for two attributes: • There are six different ways in which x and y may be distributed • Each requires a different decomposition – Case 1: x same as y; and x belongs to the Shared. Set. – Case 2: x same as y; and x does not belong to the Shared. Set. – Case 3: x and y both belong to the Shared. Set.

Sum of Products – Case 4: x belongs to Shared. Set and y does not. – Case 5: x, y don’t belong to the Shared. Set and reside on different nodes. • For each tuple t in Shared. Set, obtain • and then – Case 6: x, y don’t belong to the Shared. Set and reside on the same node. where Prod(t) is average of product of x and y for cond-t of Shared. Set

Self-decomposing Algorithms • Easy decomposability of arithmetic primitives – Average/Covariance matrix/Entropy • Control structure of algorithms is not altered – More gains possible, by altering control structure • Decomposition is driven by the set of shared attributes • Algorithm can determine shared attributes in n messages/hops • Algorithms decompose in accordance with attribute sharing – No human intervention needed • Message complexity is independent of sizes of databases

Continuing Work Determine patterns of flow in a network – Communication network traffic – Geographic/economic flows Loca l flow data