Indexing Multidimensional Data Rui Zhang http www csse

  • Slides: 14
Download presentation
Indexing Multidimensional Data Rui Zhang http: //www. csse. unimelb. edu. au/~rui The University of

Indexing Multidimensional Data Rui Zhang http: //www. csse. unimelb. edu. au/~rui The University of Melbourne Aug 2006

Outline l Backgrounds l l Multidimensional data and queries Approaches l Mapping based indexing

Outline l Backgrounds l l Multidimensional data and queries Approaches l Mapping based indexing l l l Hierarchical-tree based indexing l l Z-curve i. Distance R-tree k-d-tree Quad-tree Compression based indexing l VA-file

Multidimensional Data l Spatial data (low-dimensionality) l l l Computer Aided Design: width and

Multidimensional Data l Spatial data (low-dimensionality) l l l Computer Aided Design: width and height (40, 50) Any part that has a width of 40 and height of 50? Records with multiple attributes (medium-dimensionality) l l l Geographic Information: Melbourne (37, 145) Which city is at (30, 140)? Employee (ID, age, score, salary, …) Is there any employee whose age is under 25 and performance score is greater than 80 and salary is between 3000 and 5000 Multimedia data (high-dimensionality) l Color histograms of images Give me the most similar image to l Multimedia Features: color, shape, texture l ID … Age Score Salary …

Multidimensional Queries l l l Point query l Return the objects located at Q(x

Multidimensional Queries l l l Point query l Return the objects located at Q(x 1, x 2, …, xd). l E. g. Q=(3. 4, 6. 6). Window query l Return all the objects enclosed or intersected by the hyper-rectangle W{[L 1, U 1], [L 2, U 2], …, [Ld, Ud]}. l E. g. W={[0, 4], [2, 5]} K-Nearest Neighbor Query (KNN Query) l Return k objects whose distances to Q are no larger than any other object’ distance to Q. l E. g. 3 NN of Q=(4, 1)

Mapping Based Multidimensional Indexing Sort l x y Block Height A 0. 7 1.

Mapping Based Multidimensional Indexing Sort l x y Block Height A 0. 7 1. 2 2 100 B F 5. 8 1. 7 1. 2 3. 8 19 11 120 50 C 2. 7 2. 3 12 80 D B 5. 5 5. 8 2. 4 1. 2 25 19 90 50 D E 6. 6 5. 5 2. 4 28 25 40 90 E F 1. 7 6. 6 3. 8 2. 5 11 28 120 40 G H 2. 8 0. 6 4. 7 5. 8 36 34 100 50 G H 0. 6 2. 8 5. 8 4. 7 34 36 100 50 I 1. 6 6. 7 41 60 J 3. 4 6. 6 45 40 Story l l l Name The CBD: [0, 4][2, 5] Blocks in the CBD are: [8, 15], [32, 33] and [36, 37] General strategy: three steps l l l Data mapping and indexing Query mapping and data retrieval Filtering out false positive

The Z-curve and Other Space-Filling Curves l The Z-curve l Z-value calculation: bit-interleaving l

The Z-curve and Other Space-Filling Curves l The Z-curve l Z-value calculation: bit-interleaving l Support efficient window queries Disadvantage l l l Jumps Other space-filling curves l l l Hilbert-curves Gray-code Column-wise scan

Mapping for KNN Queries Sort 1 Name x y Street Height 24 23 22

Mapping for KNN Queries Sort 1 Name x y Street Height 24 23 22 21 C A 0. 7 2. 7 1. 2 2. 3 14 12 100 80 B F 5. 8 1. 7 1. 2 3. 8 32 13 120 50 2 C A 2. 7 0. 7 2. 3 1. 2 12 14 100 80 DI 5. 5 1. 6 2. 4 6. 7 31 22 90 60 H E 6. 6 0. 6 2. 5 5. 8 32 23 40 50 G F 1. 7 2. 8 3. 8 4. 7 13 24 120 100 G J 2. 8 3. 4 4. 7 6. 6 24 100 40 H D 0. 6 5. 5 5. 8 2. 4 23 31 50 90 BI 1. 6 5. 8 6. 7 1. 2 22 32 60 50 E J 3. 4 6. 6 2. 5 24 32 40 14 4 13 3 12 2 11 1 32 31 3 Q R = 2. 10 0. 35 0. 70 1. 05 1. 40 1. 75 l Story continued l l l New factory at Q[4, 1] Find 3 nearest buildings to Q Termination condition l l ||AQ|| == 3. 62 ||FQ|| ||BQ|| ||EQ|| ||CQ|| ||DQ|| 3. 00 3. 31 1. 84 2. 05 K candidates All in the current search circle Rank 1 2 3 Candidate B A C A E F Distance to Q 1. 81 3. 31 3. 00 3. 62 1. 84 3. 62 3. 31 3. 00 2. 05

The i. Distance l Data partitioned into a number of clusters l l Data

The i. Distance l Data partitioned into a number of clusters l l Data mapping l l Streets are concentric circles Objects mapped to street numbers Query mapping l Search circle mapped to streets intersected

Hierarchical Tree Structures R-tree l l l Minimum bounding rectangle (MBR) Incomplete and overlapping

Hierarchical Tree Structures R-tree l l l Minimum bounding rectangle (MBR) Incomplete and overlapping partitioning Disk-based; Balanced l l N 3 N A 1 N B 2 C D N 1 A D N 1 F C A C D G Space division recursively Complete and disjoint partitioning In-memory; Unbalanced There algorithms to page and balance the tree, but with more complex manipulations N 31 N 3 A N 1 K-d-tree D N 4 N 2 N 1 F E N 1 N 2 C A B 0. 5 C D N 1 G N 5 B E F G E N 2 B N 3 N 1 B C 0. 3 E F A D B N 2 B C E A A D F C G E B N 5 N 2 N 4 F G D Problem: Overlap F E C G B Problem: Empty space

Hierarchical Tree Structures (continued) l Quad-tree l l l Space divided into 4 rectangles

Hierarchical Tree Structures (continued) l Quad-tree l l l Space divided into 4 rectangles recursively. Complete and disjoint partitioning In-memory; Unbalanced There algorithms to page and balance the tree, but with more complex manipulations The point quad-tree A NW NW NE D SW NE SE A F C C B B G SW D E SE G E F

Compression Based Indexing l The dimensionality curse l The Vector Approximation File (VA-File) VA

Compression Based Indexing l The dimensionality curse l The Vector Approximation File (VA-File) VA File Skewed data

Summary of the Indexing Techniques Index Disk-based / In-memory Balanced Efficient query type Dimensi

Summary of the Indexing Techniques Index Disk-based / In-memory Balanced Efficient query type Dimensi onality Comments R-tree Disk-based Yes Point, window, k. NN Low Disadvantage is overlap K-d-tree In-memory No Point, window, k. NN(? ) Low Inefficient for skewed data Quad-tree In-memory No Point, window, k. NN(? ) Low Inefficient for skewed data Z-curve + B+-tree Disk-based Yes Point, window Low Order of the Zcurve affects performance i. Distance Disk-based Yes Point, k. NN High Not good for uniform data in very high-D VA-File Disk-based Point, window, k. NN High Not good for skewed data

Index Implementations in major DBMS l SQL Server l l Oracle l l l

Index Implementations in major DBMS l SQL Server l l Oracle l l l B+-Tree data structure Clustered indexes are sparse Indexes maintained as updates/insertions/deletes are performed B+-tree, hash, bitmap, spatial extender for R-Tree Clustered index Index organized table (unique/clustered) Clusters used when creating tables DB 2 l l l B+-Tree data structure, spatial extender for R-tree Clustered indexes are dense Explicit command for index reorganization

Recommended Readings and References l Survey on multidimensional indexing techniques l l l Mapping

Recommended Readings and References l Survey on multidimensional indexing techniques l l l Mapping based indexing l l Antonin Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. ACM SIGMOD Conference (SIGMOD) 1984. Quad-tree l l H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, Rui Zhang. i. Distance: An Adaptive B+-tree Based Indexing Method for Nearest Neighbor Search. ACM Transactions on Data Base Systems (TODS), 30(2), 2005. R-tree l l H. V. Jagadish. Linear Clustering of Objects with Multiple Atributes. ACM SIGMOD Conference (SIGMOD) 1990. i. Distance l l Rui Zhang, Panos Kalnis, Beng Chin Ooi, Kian-Lee Tan. Generalized Multi-dimensional Data Mapping and Query Processing. ACM Transactions on Data Base Systems (TODS), 30(3), 2005. Space-filling curves l l Christian Böhm, Stefan Berchtold, Daniel A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys 2001. Volker Gaede, Oliver Günther. Multidimensional Access Methods. ACM Computing Surveys 1998 Hanan Samet. The Quadtree and Related Hierarchical Data Structures. ACM Computing Surveys 1984. VA-File l Roger Weber, Hans-Jörg Schek, Stephen Blott. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. International Conference on Very Large Data Bases (VLDB) 1998.