Efficient Processing of XML Path Queries Using the

  • Slides: 29
Download presentation
Efficient Processing of XML Path Queries Using the Disk-based F&B Index Wei Wang University

Efficient Processing of XML Path Queries Using the Disk-based F&B Index Wei Wang University of New South Wales, Australia With Hongzhi Wang (HIT), Hongjun Lu (HKUST), Haifeng Jiang (IBM), Xuemin Lin (UNSW), VLDB Jianzhong Li (HIT) 2005

XML Query Processing n XML ¨ Modeled as a labeled tree n Query by

XML Query Processing n XML ¨ Modeled as a labeled tree n Query by structural constraint ¨ Simple Path Queries, e. g. , //Customer//Name ¨ Branching/Twig Queries, e. g. , //Customer[//Zipcode]//Name 1/12/2022 VLDB 2005 2

Q 1: /a/b Index or Join? n Index-based approaches a ¨ Data. Guide, 1

Q 1: /a/b Index or Join? n Index-based approaches a ¨ Data. Guide, 1 -index ¨ F&B Index b b ¨ and a few approximate indexes n Join-based approaches a a a ¨ Structural join ¨ Twig join b b b Join-based approaches appear to be more actively researched! 1/12/2022 VLDB 2005 3

Outline Introduction n Disk-based F&B Index n Experiment n Conclusions n 1/12/2022 VLDB 2005

Outline Introduction n Disk-based F&B Index n Experiment n Conclusions n 1/12/2022 VLDB 2005 4

XML Structural Indexes n “Exact” Indexes ¨ 1 -index Based on backward bisimilarity n

XML Structural Indexes n “Exact” Indexes ¨ 1 -index Based on backward bisimilarity n Covers all simple path queries n ¨ F&B Index Based on backward and forward bisimilarity n Covers all branching queries (optimally) n 1/12/2022 VLDB 2005 5

A Running Example Q 1: /a/b Q 2: /a/b[d] Q 3: /a/b[c][d] extent {b,

A Running Example Q 1: /a/b Q 2: /a/b[d] Q 3: /a/b[c][d] extent {b, b, b} 1/12/2022 VLDB 2005 6

Problems with F&B Index? n Lack of scalability ¨ Usually large in practice ¨

Problems with F&B Index? n Lack of scalability ¨ Usually large in practice ¨ No immediate solution when it cannot be accommodated in memory n n n Unbalanced, all-leaf-nodes tree Naïve solutions (e. g. , B+-tree, pre-order clustering in Lore, subtree clustering in Natix) do not work well Lack of efficiency ¨ Non-deterministic searching ¨ //-axis requires traversing the whole subtrees ¨ Much more costly when the index is not in the memory 1/12/2022 VLDB 2005 7

Outline Introduction n Disk-based F&B Index n Experiment n Conclusions n 1/12/2022 VLDB 2005

Outline Introduction n Disk-based F&B Index n Experiment n Conclusions n 1/12/2022 VLDB 2005 8

Disk-based F&B Index n n Overcome the memory limit by putting F&B index to

Disk-based F&B Index n n Overcome the memory limit by putting F&B index to the disk Naïve method does not work well Q 1: /a/b 1/12/2022 VLDB 2005 9

Basic Idea n Moral: Clustering is important Cluster by tag tape 2. Cluster by

Basic Idea n Moral: Clustering is important Cluster by tag tape 2. Cluster by parent segment & block 3. Cluster by 1 -index ID chunk ¨ Benefits: 1. n n 1/12/2022 Optimized tree traversals Enable other intelligent algorithms VLDB 2005 10

Q 1: /a/b 1/12/2022 VLDB 2005 11

Q 1: /a/b 1/12/2022 VLDB 2005 11

Q. P. by Tree Traversal n n n Dim 1: DFS/BFS Dim 2: Path/Branching

Q. P. by Tree Traversal n n n Dim 1: DFS/BFS Dim 2: Path/Branching Path Dim 3: / or // Q 5: /a/b/c Q 2: /a/b[d] Q 4: /a//c Problem: Still have to traverse the entire subtrees to process // 1/12/2022 VLDB 2005 12

Q. P. by Range. Fetch n H(1, c) = [3, 6] (chunk. ID, tag.

Q. P. by Range. Fetch n H(1, c) = [3, 6] (chunk. ID, tag. Name) Q 4: /a//c Restriction: Can only answer /p//q, where p is a simple path. 1/12/2022 VLDB 2005 13

More Data Structures n 3 more tapes: ¨ Add region code for each d-node

More Data Structures n 3 more tapes: ¨ Add region code for each d-node in the extents Extents Tape Use physical (start, end) codes n Sort d-nodes according to (start, end) n ¨ Add Doc Tape ¨ Add Value Tape 1/12/2022 VLDB 2005 14

Example 1/12/2022 VLDB 2005 15

Example 1/12/2022 VLDB 2005 15

Seg. SJ n Key observation: ¨ Structural relationship between two segments can be inferred

Seg. SJ n Key observation: ¨ Structural relationship between two segments can be inferred from the relationship between their first d-nodes in their extent. n b 1 (10, 78), (210, 297), … d 1 (19, 25), (54, 66), … Seg. SJ(/p//q) ¨ R(s, e) A = /p Take the (s, e) of the first ¨ S(s, e) D = //q d-node in each segment ¨ Structural join R and S n 1/12/2022 Using partition-based or sortingbased SJ algorithm VLDB 2005 16

Outline Introduction n Disk-based F&B Index n Experiment n Conclusions n 1/12/2022 VLDB 2005

Outline Introduction n Disk-based F&B Index n Experiment n Conclusions n 1/12/2022 VLDB 2005 17

Experiments n Setup ¨ DBLP/XMark/Tree. Bank ¨ 8 representative queries n Dim 1: PC/AD

Experiments n Setup ¨ DBLP/XMark/Tree. Bank ¨ 8 representative queries n Dim 1: PC/AD n Dim 2: Path/Twig n Dim 3: Large/Small ¨ DFS, BFS, Range. Fetch, Seg. SJ ¨ No. K, Twig. Stack, Kaushik’s algorithm in [SIGMOD 04] ¨ Metric: time/PIO/LIO 1/12/2022 VLDB 2005 18

Varying Buffer Size (PC-Path) 1/12/2022 VLDB 2005 19

Varying Buffer Size (PC-Path) 1/12/2022 VLDB 2005 19

Varying Buffer Size (PC-Twig) 1/12/2022 VLDB 2005 20

Varying Buffer Size (PC-Twig) 1/12/2022 VLDB 2005 20

Varying Buffer Size (AD-Path) 1/12/2022 VLDB 2005 21

Varying Buffer Size (AD-Path) 1/12/2022 VLDB 2005 21

Varying Buffer Size (AD-Twig) 1/12/2022 VLDB 2005 22

Varying Buffer Size (AD-Twig) 1/12/2022 VLDB 2005 22

Buffer Hit Ratio 1/12/2022 VLDB 2005 23

Buffer Hit Ratio 1/12/2022 VLDB 2005 23

Scalability 1/12/2022 VLDB 2005 24

Scalability 1/12/2022 VLDB 2005 24

Comparing with Other Systems 1/12/2022 VLDB 2005 25

Comparing with Other Systems 1/12/2022 VLDB 2005 25

Outline Introduction n Disk-based F&B Index n Experiment n Conclusions n 1/12/2022 VLDB 2005

Outline Introduction n Disk-based F&B Index n Experiment n Conclusions n 1/12/2022 VLDB 2005 26

Conclusions n Disk-based F&B Index ¨ Store and cluster the index on the disk

Conclusions n Disk-based F&B Index ¨ Store and cluster the index on the disk ¨ More efficient and intelligent query processing algorithms n n Demonstrated good scalability and query efficiency Expecting new query processing algorithms based on index probing (in addition to joinbased approaches) 1/12/2022 VLDB 2005 27

Q&A Thank You! 1/12/2022 VLDB 2005 28

Q&A Thank You! 1/12/2022 VLDB 2005 28

Related Work n Indexes ¨ Exact: Data. Guide, 1 -index, F&B Index ¨ Approx:

Related Work n Indexes ¨ Exact: Data. Guide, 1 -index, F&B Index ¨ Approx: Approx. Data. Guide, A(k)-index, D(k)-index, M*(k)-index n n n Join-based approaches Hybrid approach: “mixed-mode” in [VLDB 03] Niagara ¨ [VLDB 03] combines tree traversals + joins ¨ [SIGMOD 04] use 1 -index to accelerate joins n Clustering ¨ Lore: pre-order ¨ Natix: subtree 1/12/2022 VLDB 2005 29