Semantic Data Compression Techniques for NASA and Mobile

Semantic Data Compression Techniques for NASA and Mobile Computing Databases Principal Investigators: G. Ozsoyoglu, Z. M. Ozsoyoglu Case Western Reserve University Nov 7, 2002 1

Semantic Data Compression Relevance and Impact Relevance: Table data occurs frequently in computer networks, distributed mobile networks, and telecommunication networks such as the Earth Science Enterprise, Space Science Enterprise, Mars Network, and Space-Based Internets of NASA. Compression and querying of stream data is directly applicable to NASA projects. Impact: Databases will be compressed on a “query-need” basis. Query engines will be aware of the compression employed and perform efficient querying. 2

Current State of the Art A large number of syntactic compression techniques. Syntactic compression: Compress byte strings. Semantic Compression (new): Employ data semantics in approximating data. Answer queries with a guaranteed upper bound on the error of approximation. * Representative tuples and outliers (row-wise relationships) * Classification and regression trees (column-wise) * Employ attribute domain information. 3

Project Goals • Semantic-based relational database compression • High Data Compression Ratios • Efficient Query Processing Techniques • User-Specified Query Error Bounds • Suitable for Real-Time Computing (when needed) • Suitable for time-constrained query processing 4

Details #1 Lossy compression Relation R Compressed Relation Rc Rid Age Salary r 1 20 50 K Rid P 1. pid P 1. Signature r 2 70 65 K r 1 p 1 YY r 3 30 40 K r 2 p 1 NY r 4 40 90 K r 3 p 1 r 5 50 120 K r 4 r 6 50 145 K P 1. Outlier P 2. Pid P 2. Signature p 3 YY P 4 NN YY p 4 YY p 2 YY P 5 YY r 5 p 2 YY P 6 YY r 6 p 2 YN p 6 YN 70 145 K P 2. Outlier 65 K Representative Relation P 1 with error tolerances t. Age = 10 and t. Salary = 15 K Pid Age Salary p 1 20 50 K p 2 50 105 K Tuple p 1 represents the rectangle: 20 K * (20, 50 K) 70 K 10 30 5

Details #2 Multi-level lossy compression P 2. Pid P 2. Signature p 3 YY P 4 NN YY p 4 YY p 2 YY P 5 YY r 5 p 2 YY P 6 YY r 6 p 2 YN p 6 YN Rid P 1. pid P 1. Signature r 1 p 1 YY r 2 p 1 NY r 3 p 1 r 4 Representative Relation P 2 Pid Age Salary p 3 20 50 K p 4 30 40 K P 1. Outlier 70 145 K P 2. Outlier 65 K Representative Relation P 3 Error Tolerances t. Age = 0 t. Salary = 0 Pid Age Salary p 4 40 90 K p 5 50 120 K 6

Details #3 * Monotonically Decreasing Error Bounds: Error tolerances t. Age=10 and t. Salary = 15 K Error Tolerances t. Age = 0 and t. Salary = 0 * Guaranteed Query Error Bounds * User-specified Guaranteed Error Bounds in Queries: SELECT … FROM. . WHERE … ERROR BOUND Age = <20, 5> and Salary = <15, 000, 5, 000> 7

Details #4 Compromise between query processing efficiency and guaranteed error bounds: One Extreme: Main-memory-only query processing; Large error bounds; Small query processing times. Other Extreme: Disk-based query processing; Small error bounds; Large query processing times. 8