A survey Off the Record Using Alternative Data

  • Slides: 22
Download presentation
A survey “Off the Record” – Using Alternative Data Models to Increase Data Density

A survey “Off the Record” – Using Alternative Data Models to Increase Data Density in Data Warehouse Enviroments. Presented by: Victor Gonzalez-Castro Lachlan Mac. Kinnon 1

Agenda § Introduction § Data Sparsity § State of the art § § §

Agenda § Introduction § Data Sparsity § State of the art § § § Relational Model The Triple Store The Binary Model The Associative model The Transrelational model § Our proposal § Questions 2

Introduction • In Data Warehouse environments Data Sparsity is a common issue that remains

Introduction • In Data Warehouse environments Data Sparsity is a common issue that remains unresolved. • Alternative Data Models that abandon the traditional record storage/manipulation structure have been researched. • We are investigating the use of these alternative data models to increase data density with the idea to decrease data sparsity. 3

Origin of Data Sparsity • Data sparsity is originated from the aim of answering

Origin of Data Sparsity • Data sparsity is originated from the aim of answering all possible user queries from the information stored in a Data Warehouse that contains Nulls. Time Dimension Year $ $ $ Month Day $ $ $$ Fig. 1. A three level dimension and Nulls. After [6] 4

Origin of Data Sparsity (Cont…) • Data Sparsity is the result of the Cartesian

Origin of Data Sparsity (Cont…) • Data Sparsity is the result of the Cartesian product of all dimensions and all aggregation levels. (Sparse) (Dense) Fig. 2. Data Sparsity and data density. From [6]. 5

State of the art. (Relational) • The Relational Model [7] uses the traditional record

State of the art. (Relational) • The Relational Model [7] uses the traditional record storage/manipulation structure. 1234 Nut Red London • It is the base model against which the other models will be compared. • All RDBMS made a poor management of sparsity (missing information). • Codd [7] suggested a fundamental change in the relational Model V 2, the use of a 4 value-logic. • No one has implemented this 6 fundamental change

State of the art. (Relational) • Major players on the Relational Market / SQL

State of the art. (Relational) • Major players on the Relational Market / SQL Server 7

State of the art. (Triple. Store) • The Triple Store. [1], [2]. It uses

State of the art. (Triple. Store) • The Triple Store. [1], [2]. It uses a Structure called the Name Store to keep all the names. Identifier Name 1 2 3 1 Nut 4 5 6 2 Red 3 London … … … l m n • To construct the processing Structure, uses Triples. 8

State of the art. (Triple. Store) • The major project in Triple Store is

State of the art. (Triple. Store) • The major project in Triple Store is Tri. Starp • Tristarp was stablished in 1984. Leaded by Peter King with Support from IBM Hursley labs. • Dr. Sharman from IBM Hursley [1] is visiting the Tristarp team. • Current directions • Further development of the persistent Triple Store Repository. • Continuing Research on the graph-based model. • Extending technology to manage partially structured data 9

State of the art. (Binary) • The Binary Model [4] considers that all tables

State of the art. (Binary) • The Binary Model [4] considers that all tables are Binary tables. Sur City Sur Pname Color City s 1 London s 1 Nut Red London s 2 Paris s 2 Bolt Green Paris s 3 Oslo s 3 Screw Blue Oslo Sur s 1 s 2 s 3 Pname Nut Bolt Screw Sur s 1 s 2 s 3 10 Color Red Green Blue

State of the art. (Binary) • A Major Project in the Binary Model [4]

State of the art. (Binary) • A Major Project in the Binary Model [4] is MONETDB. • Is a DBMS designed to provide high performance on complex queries against real-world sized database. • Achieves this goal using innovations at all layers of a DBMS: a storage model based on vertical fragmentation, processing speed by self-tuning relational operators, algorithms designed to exploit modern hardware, selfmanaging indexing structures, modular and extensible software architecture, etc. • It is developed at the Institute for Mathematics and Computer Science Research of The Netherlands. 11

State of the art. (Associative) • The Associative Model [3] comprises two types of

State of the art. (Associative) • The Associative Model [3] comprises two types of data structures Items and Links. Identifier Name 77 Nut 08 Identifier Source Verb Target Red 74 77 12 08 32 London 03 74 67 32 12 That is 67 Is located in • It differs from Binary and Triple store in one fundamental way; Associations themselves may be either the source or the target of other associations. • It uses Quadruplets. 12

State of the art. (Associative) • The Major product in the Associative Model is

State of the art. (Associative) • The Major product in the Associative Model is Sentences. DB. • Instead of using a separate, unique table for every different type of data, it uses a single, generic structure to contain all types of data. • Information about the logical structure of the data and the rules that govern it are stored alongside the data in the database. • The programs are truly reusable, and no longer need to be amended when the data structures change. 13

State of the art. (Transrelational) • The Trans. Relational Model. TM. [5] keeps the

State of the art. (Transrelational) • The Trans. Relational Model. TM. [5] keeps the Relational model itself but abandon the record storage structure. It uses two structures: P# PNAME COLOR CITY P 1 Bolt Blue London 4 3 2 1 P 2 Cam Blue London 1 1 4 4 P 3 Cog Green London 5 6 P 4 Nut Red Oslo 6 4 1 3 P 5 Screw Red Paris 2 2 3 2 P 6 Screw Red Paris 3 5 6 5 The Field Values Table. The Record Reconstruction Table. • Since there is currently no instantiation of the Transrelational Model available, We will build an implementation of the essential algorithms. 14

Transrelational. Algorithms 1. A file for the suppliers relation Field Values Table (FVT) 2.

Transrelational. Algorithms 1. A file for the suppliers relation Field Values Table (FVT) 2. Sort each column in asc. Record Reconst. Table (RRT) P# PNAME COLOR CITY P 1 Nut Red London P 1 Bolt Blue London 4 3 2 1 P 2 Bolt Green Paris P 2 Cam Blue London 1 1 4 4 P 3 Screw Blue Oslo P 3 Cog Green London 5 6 P 4 Screw Red London P 4 Nut Red Oslo 6 4 1 3 P 5 Cam Blue Paris P 5 Screw Red Paris 2 2 3 2 P 6 Cog Red London P 6 Screw Red Paris 3 5 6 5 1. Go to Cell [1, 1] of the FVT, fetch the value stored (P 1). P# 2. Go to the same cell [1, 1] in the RRT and fetch the value (4). It is interpreted to mean that the next field value (PNAME), is in the 4 th row of the FVT. Go to that cell and fetch the value (Nut) 3. Go to the corresponding RRT cell [4, 2] and fetch the row number (4). The next (3 rd or COLOR) is the 4 th row in the FVT (Red). 4. Go to the corresponding RRT cell [4, 3] and fetch value (1). The next 4 th or CITY) is the 1 st row in the FVT (London). P 1 PNAME COLOR CITY London Nut 5. Go to the corresponding RRT cell [4, 1] and fetch value (1). The next 5 th column does not exist, so it wraps around to the 1 st column, so then is the 1 st row in the FVT. 15 Red

Alternative Data Models Comparison Model Storage Structure Linkage Structure Relational Table (Relation) By position

Alternative Data Models Comparison Model Storage Structure Linkage Structure Relational Table (Relation) By position Triple Store Name Store Triple Store Binary Table Joins Associative Items Links Transrelational Field Values Table Record Reconstruction Table 16

Our proposal (Our aims) • To carry out an impartial survey on alternative Data

Our proposal (Our aims) • To carry out an impartial survey on alternative Data Models. • Compare whether or not the use of alternative data models can improve the Data Density in Data Warehouse environments. • Observe the effect that such data density increase has on the data sparsity. 17

Our proposal (How…) • We intend to use an implementation of each data model

Our proposal (How…) • We intend to use an implementation of each data model Trans. Relational. TM • We will use TPC-H data set to load each database. • Run a set of benchmark metrics, where available if not we will develop our metrics to determine relative performance and then consider relative data density and sparsity. 18

Just Remember… • Instead of storing data horizontally, do it vertically and eliminate duplicate

Just Remember… • Instead of storing data horizontally, do it vertically and eliminate duplicate values. 123 456 789 234 567 Bolt Screw Nut Nail Black Blue White Paris London Here are the Savings • We are abandoning the traditional Record Structure, we are going “off the record”. 19

Questions? 20

Questions? 20

Thanks !! victor@macs. hw. ac. uk Lachlan@macs. hw. ac. uk 21

Thanks !! victor@macs. hw. ac. uk Lachlan@macs. hw. ac. uk 21

References 1. G C H Sharman and N Winterbottom, The Universal Triple Machine: a

References 1. G C H Sharman and N Winterbottom, The Universal Triple Machine: a Reduced Instruction Set Repository Manager. Proceedings of BNCOD 6, pp 189 -214, 1988. 2. Tri. Starp Web Site: http: //www. dcs. bbk. ac. uk/~tristarp. Updated November, 2000. 3. Simon Williams. The Associative Model of Data, Second Edition, Lazy Software Ltd. ISBN: 1 -903453 -01 -1 www. lazysoft. com 4. Monet. DB. © 1994 -2004 by CWI. http: //monetdb. cwi. nl 5. Date, C. J. An introduction to Database Systems. Appendix A. The Transrelational Model , Eighth Edition. Addison Wesley. 2004. USA. ISBN: 0 -321 -18956 -6. 6. Pendse Nigel. Database explosion. http: //www. olapreport. com Updated Aug, 2003. 7. Codd, E. F. The Relational Model for Database Management Version 2. Addison-Wesley. 1990. ISBN 0 -201 -14192 -2. 22