Multidimensional Data Multidimensional Data Many applications of databases

Multidimensional Data

Multidimensional Data • Many applications of databases are "geographic" = 2 dimensional data. Others involve large numbers of dimensions. • Example: data about sales. - A sale is described by (store, day, item, color, size, etc. ). • Sale = point in 5 dim space. - A customer is described by (age, salary, pcode, marital status, etc. ). Typical Queries • Range queries: "How many customers for gold jewelry have age between 45 and 55, and salary less than 100 K? " • Nearest neighbor : "If I am at coordinates (a, b), what is the nearest Mc. Donalds. " • They are expressible in SQL. Do you see how?

SQL • Range queries: “How many customers for gold jewelry have age between 45 and 55, and salary less than 100 K? ” SELECT * FROM Customers WHERE age>=45 AND age<=55 AND sal<100; • Nearest neighbor : “If I am at coordinates (a, b), what is the nearest Mc. Donalds. ” Suppose we have a relation Points(x, y, name) SELECT * FROM Points p WHERE p. name=‘Mc. Donalds’ AND NOT EXISTS ( SELECT * FROM POINTS q WHERE (q. x a)*(q. x a)+(q. y b)*(q. y b) < (p. x a)*(p. x a)+(p. y b)*(p. y b) AND q. name=‘Mc. Donalds’ );

Big Impediment • For these types of queries, there is no clean way to eliminate lots of records that don't meet the condition of the WHERE clause. An Approach for range queries Index on attributes independently. - Intersect pointers in main memory to save disk I/O.

Attempt at using B trees for MD queries • Database = 1, 000 points evenly distributed in a 1000× 1000 square. Stored in 10, 000 blocks (100 recs per block) • B tree secondary indexes on x and on y Range query {(x, y) : 450 x 550, 450 y 550} • 100, 000 pointers (i. e. 1, 000/10) for the x range, and same for y • 10, 000 pointers for answer (found by pointersection) • Retrieve 10, 000 records. If they are stored randomly we need to do 10, 000 I/O’s. Add here the cost of B Trees: • Root of each B tree in main memory • Suppose leaves have avg. 200 keys 500 disk I/O in each B tree to get pointer lists 1000 + 2(for intermediate B tree level) disk I/O’s Total • 11, 002 disk I/O’s, more than sequential scan of file = 10, 000 I/O’s.

Nearest Neighbor query using B trees • Turn NN to (10, 20) into a range query {(x, y): 10 d x 10+d, 20 d y 20+d } • Possible problem: 1. No point in the selected range 2. The closest point inside may not be the answer • Solution: re execute range query with slightly larger d

NN queries, example • Same relation Points and its indexes on x and y as before, and Query: NN to (10, 20) • • Choose d = 1 range query = {(x, y): 9 x 11, 19 y 21} 2000 points in [9, 11], 2000 points in [19, 21] For each dimension, we pay 10+1 I/O’s to get pointers from the B Tree leaves +1 is because points with x=9 may not start just at the beginning of the leaf • Add an extra I/O for the intermediate node when finding the start of the range for each index • Total 24 + 1 disk I/O’s to get the answer, • assuming 1 of the 4 points is the answer, which we can determine by their coordinates, prior to getting the data blocks holding the points • However, if d is too small, we have to run another range query with a larger d

Grid files (hash like structure) • Divide data into stripes in each dimension • Each rectangle is a bucket • Example: database records (age, salary) for people who buy gold jewelry. Data: (25, 60) (45, 60) (50, 75) (50, 100) (50, 120) (70, 110) (85, 140) (30, 260) (25, 400) (45, 350) (50, 275) (60, 260)

Grid file

Operations Lookup Find coordinates of point in each dimension gives you a bucket to search. Nearest Neighbor Lookup point P. Consider points in its bucket. • Problem: there could be points in adjacent buckets that are closer. • Problem: there could be no points at all in the bucket: widen search? Range Queries Ranges define a region of buckets. • Buckets on border may contain points not in range. • Example: 35 < age <= 45; 50 < salary <= 100. Queries Specifying Only One Attribute • Must search a whole row or column of buckets.

Insertion • Use overflow buckets, or split stripes in one or more dimensions • Insert (52, 200).

Insertion • Insert (52, 200). Split central bucket, for instance by splitting central salary stripe (One possibility) • Blocks of 3 buckets are to be processed. • In general the blocks of n buckets are to be processed during a split.

Grid files Advantages • Good for multiple key search • Supports Partial Match, Range Queries, NN queries Disadvantages • Space management overhead • Need partitioning ranges that evenly split keys • Possibility of overflow buckets for insertion

Partitioned hashing I • If we hash the concatenation of several keys then such a hash table cannot be used in queries specifying only one dimension (key). • Instead create hash function h as a concatenation of n hash functions, one for each dimensional attribute. • h = (h 1, …, hn) • the bucket where to put a tuple (v 1, …, vn) is computed by concatenating the bit sequences h 1(v 1)…hn(vn).

Partitioned hashing II • Example: Gold jewelry with • first bit: age mod 2 • bits 2 and 3: salary mod 4 • Partial match? • Range? • NN?

Partitioned hashing III • Partial match query – specifying only the value of a: • compute hage(a), which could be, say 1. • Then, locate all the relevant buckets, which are from 100 to 111. – specifying only the value of salary: • compute hsalary(s), which could be, say 10. • Then, locate the relevant buckets, which are 010 and 110. • Bad for: • range • nearest neighbor queries

Grid files vs. partitioned hashing • If many dimensions many empty cells in grid. While partitioned hashing is OK. • Both support exact and partial match queries. • Grid files good for range and NN queries, while partitioned hashing is not at all.

Multiple key indexes • Index on one attribute provides pointer to an index on the other. • Let V be a value of the first attribute. • Then the index we reach by following the pointer for V is an index into the set of points that have V for their first value in the first attribute and any value for the second attribute.

Example • “Who buys gold jewelry” (age and salary only). Raw data in age salary pairs: (25; 60) (45; 60) (50; 75) (50; 100) (50; 120) (70; 110) (85; 140) (30; 260) (25; 400) (45; 350) (50; 275) (60; 260) • Question: For what kinds of queries will a multiple key index (age first) significantly reduce the number of disk I/O's? The indexes can be organized as B Trees.

Operations Partial match queries • If the first attribute is specified, then the access is quite efficient • If the first attribute isn’t specified, then we have to search every sub index. Range queries • Quite well, provided the individual indexes themselves support range queries on their attribute (e. g. they are B Trees) - Example. Range query is 35 age 55 AND 100 sal 200 NN queries • Similar to range queries. Also, the indexes should be “primary” ones if we want to support efficiently range queries.

KD Trees • Levels rotate among the dimensions, partitioning the points by comparison with a value for that dimension. • Leaves are blocks holding the data records.

Geometrically… • Remember we didn’t want the stripes in grid files to continue all along the vertical or horizontal direction. • Here they don’t.

Operations Lookup in KD Trees • Find appropriate leaf by binary search. Is the record there? Insert Into KD Trees • Lookup record to be inserted, reaching the appropriate leaf. • If there is room, put record in that block. • If not, find a suitable value for the appropriate dimension and split the leaf block using the appropriate dimension. Example • Someone 35 years old with a salary of $500 K buys gold jewelry. • Belongs in leaf with (25; 400) and (45; 350). • Too full: split on age. See figure next.

Someone 35 years old with a salary of $500 K buys gold jewelry. It’s “age” turn to be used for split. Split at 35; it’s the median.

Queries Partial match queries • When we don’t know the value of the attribute at the node, we must explore both of its children. - E. g. find points with age=50 Range Queries • Sometimes a range will allow us to move to only one child of a node. • But if the range straddles the splitting value then we must explore both children.

KD trees in secondary storage • If internal nodes don’t fit in main memory group them into blocks.

Quad trees • Nodes split at all dimensions at once • For a quad tree of k dimensions, each interior node has 2 k children. 400 a b d Sal i Age 50, Sal 200 i h j k Age 75, Sal 100 f g l c e 0 c g h j k Age 25, Sal 300 d a b e l f 100

Why quad trees? • k dimensions node has 2 k children, e. g. k=7 128 children. • If 128, or 27, pointers can fit in a block, then k=7 is a convenient number of dimensions.

Quad Tree Insert and Queries Insert • Find leaf node in which new point belongs. • If room, put it there. • If not, make the leaf an interior node and give it leaves for each quadrant. Split the points among the new leaves. • Problem: may make lots of null pointers, especially in high dimensions. Quad Tree Queries • Single point queries: easy; just go down the tree to proper leaf. • Range queries: varies by position of range. - Example: a range like 45<age<55; 180<salary<220 requires search of four leaves. Nearest neighbor: Problems and strategies similar to grid files.