XML indexing Ak indices Ragini Rahalkar Roshith Rajagopal

Outline o o o o o Introduction Motivation Labeled graph and index graph Bisimilarity

Introduction o o o Structural summaries Evaluating Path Expressions A(K) index n n n

Prior Schemes o 1 -index [Milo, Suciu 1999] n n n NFA rather than

Limitations of Prior Work o Size n n o Each and every path is

Labeled Graph o G=(Vg, Eg, root, ΣG, label, oid, value) o Node path and

Index Graph I(G) o o o o Extent of a node Regular expression execution

Notion of Bisimilarity o o Symmetric and binary relation For two nodes u and

The A(k) index o o Local similarity Using Equivalence class partition n o Notion

K-bisimilarity o Defined inductively as: n n for any two nodes, u and v,

A(k) index properties o o o If nodes u and v are k-bisimilar, then

A(k) index construction o Partitioning – compute_k_bisim n Notion of successor of a node

Compute_k_bisim(G, k) Begin 1. Q and X are each a list of node-sets 2.

Compute_A(k)_index(G, k) Begin 1. Compute_k_bisim(G, k) 2. foreach equiv. class in k-bisimulation do 3.

Query Evaluation Schemes o o o Index is queried using regular path expressions. Path

Query Evaluation Techniques o Forward Evaluation Strategy n n o Simulation of NFA on

Approximate Index Graphs: o o o While evaluating R on Index graph, we add

Approximate Index Graphs o o o When node B is accepted along a path

Implementation – Data Structures o o Data Graph Representation Element_HT n o Hashtable –

Implementation o Index Tree n n n o Eq. Class_HT - Hashtable (Eq. Class.

Sample Results Size of Index graph v/s K 12/21/2021 CSE 636 25

Summary o o Generalization of 1 -index Value of k and tradeoff between the

References o o Exploiting Local Similarity for Indexing Paths in Graph-Structured Data [Raghav Kaushik,

Slides: 28

Download presentation

XML indexing – A(k) indices - Ragini Rahalkar - Roshith Rajagopal 12/21/2021 CSE 636 1

Outline o o o o o Introduction Motivation Labeled graph and index graph Bisimilarity and A(k) index Construction of A(k) index Query Evaluation Approximate index handling Implementation and testing Summary 12/21/2021 CSE 636 2

Introduction o o o Structural summaries Evaluating Path Expressions A(K) index n n n Indexing scheme for large graph data like XML Not all structure is interesting Paths longer than k Smaller and faster Schemaless data Competitive for arbitrary path expressions 12/21/2021 CSE 636 3

Prior Schemes o 1 -index [Milo, Suciu 1999] n n n NFA rather than DFA (smaller) split graph nodes into equivalence classes based on incoming paths from the root Go for refinements (approximations) o o 12/21/2021 similarity bisimilarity CSE 636 4

Limitations of Prior Work o Size n n o Each and every path is indexed which is not necessary (does not exploit local similarity) 1 -index size can be big too! Designed to answer queries involving arbitrarily complex paths, but. . . n such paths may never show up in queries 12/21/2021 CSE 636 5

Labeled Graph o G=(Vg, Eg, root, ΣG, label, oid, value) o Node path and label path o Path expression o Regular language 12/21/2021 CSE 636 6

12/21/2021 CSE 636 7

Index Graph I(G) o o o o Extent of a node Regular expression execution with I(G) Safe extent mapping Containment of results of path expressions Precise index graph 1 -index graph – never bigger than data graph Can be computed in O( m log n ) 12/21/2021 CSE 636 8

Notion of Bisimilarity o o Symmetric and binary relation For two nodes u and v , u ≈b v if n n o o u and v have same labels If u’ is a parent of u, then there is a parent v’ of v such that u’ ≈ v’ and vice versa Objects 8 and 9 are bisimilar Objects 21 and 23 are not bisimilar 12/21/2021 CSE 636 9

The A(k) index o o Local similarity Using Equivalence class partition n o Notion of false paths n o Grouping according to labels Classification by label—business and cultural Absolute precision and grouping similar data to allow index size affected by updates in the values of k 12/21/2021 CSE 636 10

12/21/2021 CSE 636 11

12/21/2021 CSE 636 12

K-bisimilarity o Defined inductively as: n n for any two nodes, u and v, u ≈0 v if u and v have same labels u ≈k v iff o o 12/21/2021 u ≈k-1 v and For every parent u’ of u and v’ of v u’ ≈k-1 v’ CSE 636 13

1 A 2 B 3 C 2 B 4 D 5 D 4, 5 6 E 7 E 6, 7 G 12/21/2021 A (0) 1 A 3 C 1 A 2 B 3 C D 4 D 5 D E 6, 7 A (1) CSE 636 E 7 E A (2) = 1 -INDEX 14

A(k) index properties o o o If nodes u and v are k-bisimilar, then the set of labelpaths of length k into them is the same. The set of label-paths of length k into an A(k)-index node is the set of label-paths of length k into any node in its extent. The A(k)-index is precise for any simple path expression of length less than or equal to k. The A(k)-index is safe, i. e. , its result on a path expression always contains the graph result for that query. The (k + 1)-bisimulation is either equal to or is a refinement of the k-bisimulation. Let v; x; y be three nodes such that the shortest path to x from v or to y from v contains more than k edges. If an edge is added or deleted going from a node u to v, this update does not affect the k-bisimilarity relationship between x and y 12/21/2021 CSE 636 15

A(k) index construction o Partitioning – compute_k_bisim n Notion of successor of a node n Notion of stability n Two sets of nodes A and B- Partition as o A ∩ SUCC(B) o A – SUCC(B) n Computation of k+1 bisimulation from k bisimulation o Copy of k bisimulation divided into equivalence classes until they are stable with equivalence classes of k bisimulation n Time – O(km) Space- O(m) where m is no of edges 12/21/2021 CSE 636 16

Compute_k_bisim(G, k) Begin 1. Q and X are each a list of node-sets 2. Q = partition VG by label 3. X = (a copy of) Q 4. for i=1 to k do 5. foreach X 1 in X do 6. compute Succ(X 1) 7. for each Q 1 in Q do 8. replace Q 1 by Q 1 ∩ Succ(X 1) and Q 1 - Succ(X 1) 9. if there was no split then 10. break 11. X = (a copy of) Q End 12/21/2021 CSE 636 17

Compute_A(k)_index(G, k) Begin 1. Compute_k_bisim(G, k) 2. foreach equiv. class in k-bisimulation do 3. create an index node I 4. ext[I] = data nodes in the equivalence Class 5. foreach edge from u to v in G do 6. I[u] = index node containing u 7. I[v] = index node containing v 8. if there is no edge from I[u] to I[v] then 9. add an edge from I[u] to I[v] End 12/21/2021 CSE 636 18

Query Evaluation Schemes o o o Index is queried using regular path expressions. Path expressions are of the form P = Root. R Query Evaluation Techniques: n n Forward Evaluation Backward Evaluation 12/21/2021 CSE 636 19

Query Evaluation Techniques o Forward Evaluation Strategy n n o Simulation of NFA on the graph Index graph traversed breadth first , making corresponding transitions Backward Evaluation Strategy n n n Find nodes bearing final labels in R R evaluated in reverse manner from these nodes Intuition: end of the expression more selective than the earlier paths, thus processing cheaper 12/21/2021 CSE 636 20

Approximate Index Graphs: o o o While evaluating R on Index graph, we add nodes in the Ext[B] rather than B to the result set. A(k) index is safe Result set for R is superset of the target set in the data graph. 12/21/2021 CSE 636 21

Approximate Index Graphs o o o When node B is accepted along a path of length <=K in the A(k) Index Graph , a node in Ext[B] must be in the target set of R When index node accepted by a longer path, the data node initially added to a maybe set M instead of result set. Nodes in M are validated by reverse execution of the automation on the data graph beginning with each node in M 12/21/2021 CSE 636 22

Implementation – Data Structures o o Data Graph Representation Element_HT n o Hashtable – (Node. ID, Element) Pairs Attribute_HT n n n Hashtable (Node. ID -1, IDREF Attribute) Pairs The Key of this hashtable is the Node. ID of the element of this attribute. 12/21/2021 CSE 636 23

Implementation o Index Tree n n n o Eq. Class_HT - Hashtable (Eq. Class. ID, Vector of Node. IDs in that Eq. Class) Generated from Compute_K_Bisim Link Table n n n Linktable_HT - Hashtable (Eq. Class. ID, Vector of Child Eq. Class. IDs) Generated from Compute_A(K)_index 12/21/2021 CSE 636 24

Sample Results Size of Index graph v/s K 12/21/2021 CSE 636 25

Summary o o Generalization of 1 -index Value of k and tradeoff between the size of the index graph and accuracy Small values of k perform better than 1 -Index Future scope n Use in schema extraction and query optimization 12/21/2021 CSE 636 26

References o o Exploiting Local Similarity for Indexing Paths in Graph-Structured Data [Raghav Kaushik, Ehud Gudes et all] Index Structures for Path Expressions [Milo, Suciu 1999] 12/21/2021 CSE 636 27

THANK YOU 12/21/2021 CSE 636 28