Web Data Management Indexes In this lecture Indexes

Web Data Management Indexes

In this lecture • Indexes – – – XSet Region algebras Indexes for Arbitrary Semistructured Dataguides T-indexes Index Fabric Resources • • Index Structures for Path Expressions by Milo and Suciu, in ICDT'99 XSet description: http: //www. openhealth. org/XSet/ • Data on the Web Abiteboul, Buneman, Suciu : section 8. 2

The problem • Input: large, irregular data graph • Output: index structure for evaluating regular path expressions

The Data Semistructured data instance = a large graph

The queries Regular expressions (using Lorel-like syntax) SELECT X f. ROM (Bib. *. author). (lastname|firstname). Abiteboul X Select x from part. _*. supplier. name x Requires: to traverse data from root, return all nodes x reachable by a path matching the given path expression. Select X From part. _*. supplier: {name: X, address: “Philadelphia”} Need index on values to narrow search to parts of the database that contain the string “Philadelphia”.

Analyzing the problem • what kind of data – tree data (XML): easier to index – graph data: used in more complex applications • what kind of queries – restricted regular expressions (e. g. XPath): may be more efficient – arbitrary regular expressions: rarely encountered in practice

XSet: a simple index for XML • Part of the Ninja project at Berkeley • Example XML data:

XSet: a simple index for XML Each node = a hashtable Each entry = list of pointers to data nodes (not shown)

XSet: Efficient query evaluation (R 1) (R 2) (R 3) (R 4) SELECT X FROM part. name X SELECT X FROM part. supplier. name X SELECT X FROM *. supplier. name X SELECT X FROM part. *. subpart. name X -yes -maybe • To evaluate R 1, look for part in the root hash table h 1, follow the link to table h 2, then look for name. • R 4 – following part leads to h 2; traverse all nodes in the index (corresponding to *), then continue with the path subpart. name. • Thus, explore the entire subtree dominated by h 2. • Will be efficient if index is small and fits in memory • R 3 – leading wild card forces to consider all nodes in the index tree, resulting in less efficient computation than for R 4. • Can index the index itself. • Retrieve all hash tables that contain a supplier entry, continue a normal search from there.

Region Algebras • structured text = text with tags (like XML) • powerful indexing techniques [Baeza-Yates, Gonnet, Navarro, Salminen, Tompa, etc. ] • New Oxford English Dictionary • critical limitation: ordered data only (like text) • Assume: data given as an XML text file, and implicit ordering in the file. • less critical limitation: restricted regular expressions

Region Algebras: Definitions • data = sequence of characters [c 1 c 2 c 3 …] • region = segment of the text in a file – representation (x, y) = [cx, cx+1, … cy], x – start position, y – end position of the region – example: <section> … </section> • region set = a set of regions s. t. any two regions are either disjoint or one included in the other – example all <section> regions (may be nested) – Tree data – each node defines a region and each set of nodes define a region set. – example: region p 2 consisting of text under p 2, set {p 2, s 1} is a region set with three regions

Representation of a region set • Example: the <subpart> region set: • region algebra = operators on region set, s 1 op s 2 defines a new region set

Region algebra: some operators • • • s 1 intersect s 2 = {r | r s 1, r s 2} s 1 included s 2 = {r | r s 1, r´ s 2, r r´} s 1 including s 2 = {r | r s 1, r´ s 2, r r´} s 1 parent s 2 = {r | r s 1, r´ s 2, r is a parent of r´} s 1 child s 2 = {r | r s 1, r´ s 2, r is child of r´} Examples: <subpart> included <part> = { s 1, s 2, s 3, s 5} <part> including <subpart> = {p 2, p 3} <name> child <part> = {n 1, n 3, n 12}

Efficient computation of Region Algebra Operators Example: s 1 included s 2 s 1 = {(x 1, x 1'), (x 2, x 2'), …} s 2 = {(y 1, y 1'), (y 2, y 2'), …} (i. e. assume each consists of disjoint regions) Algorithm: if xi < yj then i : = i + 1 if xi' > yj' then j : = j + 1 otherwise: print (xi, xi'), do i : = i + 1 Can do in sub-linear time when one region is very small

From path expressions to region expressions • Use region algebra operators to answer regular path expressions: part. name part. supplier. name *. supplier. name part. *. subpart. name child (part child root) name child (supplier child (part child root)) name child supplier name child (subpart included (part child root)) • Only restricted forms of regular path expressions can be translated into region algebra operators – expressions of the form R 1. R 2…Rn, where each Ri is either a label constant or the Kleene closure *. Region expressions correspond to simple XPath expressions

From path expressions to region expressions • Answering more complex queries: Select X From *. subpart: {name: X, *. supplier. address: “Philadelphia”} • Translates into the following region algebra expression: Name child (subpart includes (supplier parent (address intersect “Philadelphia”))) • “Philadelphia” denotes a region set consisting of all regions corresponding to the word “Philadelphia” in the text. • Such a region can be computed dynamically using a full text index. • Region expressions correspond to simple XPath expressions

Indexes for Arbitrary Semistructured Data • A semistructured data instance that is a DAG

Indexes for Arbitrary Semistructured Data • • • The data represents employees and projects in a company. Two kinds of employees – programmers and statisticians Three kinds of links to projects – leads, workson, consultants Index graph – reduced graph that summarizes all paths from root in the data graph Example: node p 1 – paths from root to p 1 labeled with the following five sequences: Project Employee. leads Employee. workson Programmer. employee. leads Programmer. employee. workson • • Node p 2 – paths from root to p 2 labeled by same five sequences p 1 and p 2 are language-equivalent

Indexes for Arbitrary Semistructured Data • For each node x in the data graph, Lx = {w| a path from the root to x labeled w} x, y x y Lx = Ly [x] = {y | x y } Nodes(I) = {[x] | x nodes(G) I = Edges(I) = {[x] [y] | x [x], y [y], x y }

Indexes for Arbitrary Semistructured Data • We have the following equivalences: e 1 e 2 e 3 e 4 e 5 p 1 p 2 p 3 p 4 p 5 p 6 p 7

Indexes for Arbitrary Semistructured Data • Computing path expression queries – Compute query on I and obtain set of index nodes – Compute union of all extents Select X From statistician. employee. (leads|consults): X • • • Returns nodes h 8, h 9. Their extents are [p 5, p 6, p 7] and [p 8], respectively; result set = [p 5, p 6, p 7, p 8] Always: size(I) size(G) Efficient when I can be stored in main memory Checking x y is expensive.

Indexes for Arbitrary Semistructured Data Use bisimulation instead of Fact: x, y x b y x y Use the same construction, but [u] now refers to b instead of . Bisimulation: Let DB be a data graph. A relation is a bisimulation on the reversed graph (i. e. all edges have their direction reversed) if the following conditions hold: 1. If x y and x is a root, then so is y. 2. Conversely, if x y and y is a root, then so is x. 3. If x y, then for any edge x x there exists an edge y y, s. t. x y. 4. Conversely, if x y, then for any edge y y, then there exists an edge x x s. t. x y.

Data. Guides • Goldman & Widom [VLDB 97] – graph data – arbitrary regular expressions

Data. Guides Definition given a semistructured data instance DB, a Data. Guide for DB is a graph G s. t. : - every path in DB also occurs in G - every path in G occurs in DB - every path in G is unique

Dataguides Example:

Data. Guides • Multiple Data. Guides for the same data:

Data. Guides Definition Let w, w’ be two words (i. e. word queries) and G a graph w G w’ if w(G) = w’(G) Definition G is a strong dataguide for a database DB if G is the same as DB

Data. Guides Example: • G 1 is a strong dataguide • G 2 is not strong person. project ! DB dept. project person. project ! G 2 dept. project

Data. Guides • Constructing the strong Data. Guide G: Nodes(G)={{root}} Edges(G)= while changes do choose s in Nodes(G), a in Labels add s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G) add (x -a->y) to Edges(G) • Use hash table for Nodes(G) • This is precisely the powerset automaton construction.

Data. Guides • How large are the dataguides ? – if DB is a tree, then size(G) <= size(DB) • why? answer: every node is in exactly one extent of G • here: dataguide = XSet – How many nodes does the strong dataguide have for this DB ? 20 nodes (least common multiple of 4 and 5) Dataguides usually fail on data with cyclic schemas, like: