CMU SCS Carnegie Mellon Univ Dept of Computer
- Slides: 46
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15 -415/615 - DB Applications C. Faloutsos – A. Pavlo Lecture#25: Column Stores
CMU SCS Administrivia • • HW 7 Phase 1: Wed Nov 9 th HW 7 Phase 2: Mon Nov 28 th HW 8: Mon Dec 5 th Final Exam: Tue Dec 13 th @ 5: 30 pm – Exam will be held in two locations. We will send an email with your assigned room. Faloutsos/Pavlo CMU SCS 15 -415/615 2
CMU SCS Today’s Class • • • Storage Models System Architectures Vectorization Compression Data Modification Faloutsos/Pavlo CMU SCS 15 -415/615 3
CMU SCS Wikipedia Example CREATE TABLE useracct ( user. ID INT PRIMARY KEY, user. Name VARCHAR UNIQUE, ⋮ ); CREATE TABLE pages ( page. ID INT PRIMARY KEY, title VARCHAR UNIQUE, latest INT REFERENCES revisions (rev. ID), ); CREATE TABLE revisions ( rev. ID INT PRIMARY KEY, page. ID INT REFERENCES pages (page. ID), user. ID INT REFERENCES useracct (user. ID), content TEXT, updated DATETIME ); 4
CMU SCS OLTP • On-line Transaction Processing: – Short-lived txns. – Small footprint. – Repetitive operations. SELECT * FROM useracct WHERE user. Name = ? AND user. Pass = ? UPDATE useracct SET last. Login = NOW(), hostname = ? WHERE user. ID = ? Faloutsos/Pavlo SELECT FROM INNER ON WHERE P. *, R. * pages AS P JOIN revisions AS R P. latest = R. rev. ID P. page. ID = ? INSERT INTO revisions VALUES (? , ? …, ? ) 5
CMU SCS OLAP • On-line Analytical Processing: – Long running queries. – Complex joins. – Exploratory queries. SELECT COUNT(U. last. Login), EXTRACT(month FROM U. last. Login) AS month FROM useracct AS U WHERE U. hostname LIKE ‘%. gov’ GROUP BY EXTRACT(month FROM U. last. Login) Faloutsos/Pavlo CMU SCS 15 -415/615 6
CMU SCS Data Storage Models • There are different ways to store tuples. • We have been assuming the n-ary storage model this entire semester. Faloutsos/Pavlo CMU SCS 15 -415/615 7
CMU SCS n-ary Storage Model • The DBMS stores all attributes for a single tuple contiguously in a block. NSM Disk Page Faloutsos/Pavlo user. ID user. Name user. Pass hostname last. Login CMU SCS 15 -415/615 8
CMU SCS n-ary Storage Model SELECT * FROM useracct WHERE user. Name = ? AND user. Pass = ? B+Tree INSERT INTO useracct VALUES (? , …? ) NSM Disk Page Faloutsos/Pavlo user. ID user. Name user. Pass hostname last. Login CMU SCS 15 -415/615 9
CMU SCS n-ary Storage Model SELECT COUNT(U. last. Login), EXTRACT(month FROM U. last. Login) AS month FROM useracct AS U WHERE U. hostname LIKE ‘%. gov’ GROUP BY EXTRACT(month FROM U. last. Login) X NSM Disk Page Faloutsos/Pavlo user. ID user. Name user. Pass hostname last. Login CMU SCS 15 -415/615 10
CMU SCS n-ary Storage Model • Advantages – Fast inserts, updates, and deletes. – Good for queries that need the entire tuple. • Disadvantages – Not good for scanning large portions of the table and/or a subset of the attributes. Faloutsos/Pavlo CMU SCS 15 -415/615 11
CMU SCS Decomposition Storage Model • The DBMS stores a single attribute for all tuples contiguously in a block. DSM Disk Page user. ID last. Login user. Name hostname user. ID hostname user. Name hostname user. Pass hostname last. Login hostname hostname hostname user. Pass Faloutsos/Pavlo CMU SCS 15 -415/615 12
CMU SCS Decomposition Storage Model SELECT COUNT(U. last. Login), EXTRACT(month FROM U. last. Login) AS month FROM useracct AS U WHERE U. hostname LIKE ‘%. gov’ GROUP BY EXTRACT(month FROM U. last. Login) DSM Disk Page Faloutsos/Pavlo hostname hostname hostname hostname hostname CMU SCS 15 -415/615 13
CMU SCS Decomposition Storage Model • Advantages – Reduces the amount wasted I/O because the DBMS only reads the data that it needs. – Better query processing and data compression (more on this later). • Disadvantages – Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching. Faloutsos/Pavlo CMU SCS 15 -415/615 14
CMU SCS History • • • 1970 s: Cantor DBMS 1980 s: DSM Proposal 1990 s: Sybase. IQ (in-memory only) 2000 s: Vertica, Vector. Wise, Monet. DB 2010 s: Cloudera Impala, Amazon Redshift, “The Big Three” Faloutsos/Pavlo CMU SCS 15 -415/615 15
CMU SCS System Architectures • Fractured Mirrors • Partition Attributes Across (PAX) • Pure Columnar Storage Faloutsos/Pavlo CMU SCS 15 -415/615 16
CMU SCS Fractured Mirrors • Store a second copy of the database in a DSM layout that is automatically updated. – Examples: Oracle, IBM DB 2 BLU NSM Faloutsos/Pavlo DSM CMU SCS 15 -415/615 17
CMU SCS PAX • Data is still stored in NSM blocks, but each block is organized as mini columns. PAX Disk Page user. ID user. Name user. Pass hostname Faloutsos/Pavlo last. Login CMU SCS 15 -415/615 18
CMU SCS Column Stores • Entire system is designed for columnar data. – Query Processing, Storage, Operator Algorithms, Indexing, etc. – Examples: Vertica, Vector. Wise, Paraccel, Cloudera Impala, Amazon Redshift Faloutsos/Pavlo CMU SCS 15 -415/615 19
CMU SCS Today’s Class • • • Storage Models System Architectures Vectorization Compression Data Modification Faloutsos/Pavlo CMU SCS 15 -415/615 20
CMU SCS Query Processing Strategies • The DBMS needs to process queries differently when using columnar data. • We have already discussed the Iterator Model for processing tuples in the DBMS query operators. Faloutsos/Pavlo CMU SCS 15 -415/615 21
CMU SCS Iterator Model • Each operator calls next() on their child operator to process tuples one at a time. SELECT cname, amt FROM customer, account WHERE customer. acctno = account. acctno AND account. amt > 1000 next p � cname, amt next acctno=acctno s CUSTOMER Faloutsos/Pavlo CMU SCS 15 -415/615 next amt>1000 ACCOUNT 22
CMU SCS Materialization Model • Each operator consumes its entire input and generates the full output all at once. SELECT cname, amt FROM customer, account WHERE customer. acctno = account. acctno AND account. amt > 1000 p � cname, amt acctno=acctno s CUSTOMER Faloutsos/Pavlo CMU SCS 15 -415/615 amt>1000 ACCOUNT 23
CMU SCS Observations • The Iterator Model is bad with a DSM because it requires the DBMS to stitch tuples back together each time. • The Materialization Model is a bad because the intermediate results may be larger than the amount of memory in the system. Faloutsos/Pavlo CMU SCS 15 -415/615 24
CMU SCS Vectorized Model • Like the Iterator Model but each next() invocation returns a vector of tuples instead of a single tuple. • This vector does not have to contain the entire tuple, just the attributes that are needed for query processing. Faloutsos/Pavlo CMU SCS 15 -415/615 25
CMU SCS Vectorized Model • Each operator calls next() on their child operator to process vectors. SELECT cname, amt FROM customer, account WHERE customer. acctno = account. acctno AND account. amt > 1000 next M cname, amt � acctno CUSTOMER Faloutsos/Pavlo CMU SCS 15 -415/615 acctno=acctno next Macctno, amt next s amt>1000 amt ACCOUNT 26
CMU SCS Virtual IDs vs. Offsets • Need a way to stitch tuples back together. • Two approaches: – Fixed length offsets – Virtual ids embedded in columns user. ID user. Name user. Pass hostname Offsets last. Login user. ID user. Name user. Pass hostname Virtual Ids last. Login 27
CMU SCS Vectorized Model • Reduced interpretation overhead. • Better cache locality. • Compiler optimization opportunities. • AFAIK, Vector. Wise is still the only system that uses this model. Other systems use query compilation instead… Faloutsos/Pavlo CMU SCS 15 -415/615 28
CMU SCS Today’s Class • • • Storage Models System Architectures Vectorization Compression Data Modification Faloutsos/Pavlo CMU SCS 15 -415/615 29
CMU SCS Compression Overview • Compress the database to reduce the amount of I/O needed to process queries. • DSM databases compress much better than NSM databases. – Storing similar data together is ideal for compression algorithms. Faloutsos/Pavlo CMU SCS 15 -415/615 30
CMU SCS Naïve Compression • Use a general purpose algorithm to compress pages when they are stored on disk. – Example: 10 KB page in memory, 4 KB compressed page on disk. • Do we have to decompress the page when it is brought into memory? Why or why not? Faloutsos/Pavlo CMU SCS 15 -415/615 31
CMU SCS Fixed-width Compression • Sacrifice some compression in exchange for having uniform-length values Tuples per attribute. are no longer aligned at offsets user. ID user. Name user. Pass hostname last. Login Variable-Length Fixed-Length Compression Original Data Faloutsos/Pavlo user. ID CMU SCS 15 -415/615 32
CMU SCS Run-length Encoding • Compress runs of the same value into a compact triplet: All tuples are sorted start. Position, run. Length) Reduces the # of – (value, on this column. triplets user. ID sex user. ID M M F M F M (M, 0, 3) (M, 0, 6) (F, 3, 2) (F, 6, 2) (M, 5, 3) Original Sorted Data Faloutsos/Pavlo sex Unsorted Sorted RLE CMU SCS 15 -415/615 33
CMU SCS Delta Encoding • Record the difference between successive values in the same column. time temp 12: 00 99. 5 12: 01 12: 02 99. 4 99. 5 +1 +1 -1 0 (+1, 7) -1 0 12: 03 99. 6 +1 +1 +1 12: 04 99. 6 +1 +1 +1 12: 05 12: 06 99. 5 99. 4 +1 +1 0 -1 12: 07 99. 5 +1 0 0 Original Data Faloutsos/Pavlo Delta Encoding CMU SCS 15 -415/615 Delta+RLE 34
CMU SCS Bit-Vector Encoding • Store a separate bit-vector for each unique value for a particular attribute where an offset in the vector corresponds to a tuple. user. ID sex M → 1 1 1 0 0 1 1 1 F → 0 0 0 1 1 0 0 0 M M M F F M M M A ‘ 1’ means that the tuple at that offset has the bit-vector’s value Original Data Faloutsos/Pavlo sex Bit-Vector Compression CMU SCS 15 -415/615 35
CMU SCS Dictionary Compression • Replace frequent patterns with smaller integer codes. – Need to support fast encoding and decoding. – Need to also support range queries. Faloutsos/Pavlo CMU SCS 15 -415/615 36
CMU SCS Dictionary Compression • Construct a separate table of the unique values for an attribute sorted by value. SELECT * FROM users WHERE name LIKE ‘Tru%’ SELECT * FROM users WHERE name BETWEEN 70 AND 80 user. Id name value code 101 Truman 101 70 Bush 10 102 103 Obama Bush 102 103 50 10 Carter Ford 20 30 104 Reagan 104 60 Nixon 40 105 Trump 105 80 Obama 50 106 107 Nixon Carter 106 107 40 20 Reagan Truman 60 70 108 Ford 108 30 Trump 80 Original Data Compressed Data 37
CMU SCS Dictionary Compression • A dictionary needs to support two operations: – Encode: For a given uncompressed value, convert it into its compressed form. – Decode: For a given compressed value, convert it back into its original form. • We need two data structures to support operations in both directions. Faloutsos/Pavlo CMU SCS 15 -415/615 38
CMU SCS Summary • Some operator algorithms can operate directly on compressed data – Saves I/O without having to decompress! • Difficult to implement when the DBMS uses multiple compression schemes. • It’s generally good to wait as long as possible to materialize/decompress data when processing queries… Faloutsos/Pavlo CMU SCS 15 -415/615 39
CMU SCS Today’s Class • • • Storage Models System Architectures Vectorization Compression Data Modification Faloutsos/Pavlo CMU SCS 15 -415/615 40
CMU SCS Bifurcated Architecture • All txns are executed on OLTP database. • Periodically migrate changes to OLAP database. OLTP Extract Transform Load OLAP Data Warehouse OLTP Faloutsos/Pavlo CMU SCS 15 -415/615 41
CMU SCS Modifying a Column Store • Updating compressed data is expensive. • Updating sorted data is expensive. • The DBMS will store updates in an staging area and then apply them in batches. – Have to make sure that we execute queries on both the staging and main storage. Faloutsos/Pavlo CMU SCS 15 -415/615 42
CMU SCS Delta Store • Stage updates in delta store and periodically apply them in batches to the main storage. – Examples: Vertica, SAP HANA Delta Faloutsos/Pavlo Main CMU SCS 15 -415/615 43
CMU SCS HTAP • Hybrid Transaction-Analytical Processing • Single database instance that can handle both OLTP workloads and OLAP queries. – Row-store for OLTP – Column-store for OLAP – Examples: SAP HANA, Mem. SQL, Hy. Per, Splice. Machine, Peloton, Cloudera Kudu (? ? ? ) Faloutsos/Pavlo CMU SCS 15 -415/615 44
CMU SCS Conclusion • If you’re running OLAP queries, you need to be using a column store. • Don’t let anybody try to tell you otherwise. Faloutsos/Pavlo CMU SCS 15 -415/615 45
CMU SCS Rest of the Semester • • Mon Nov 28 th – Column Stores Wed Nov 30 th – Data Warehousing + Mining Mon Dec 5 th – Splice. Machine Guest Speaker Wed Dec 7 th –Review + Systems Potpourri http: //cmudb. io/f 16 -systems Faloutsos/Pavlo CMU SCS 15 -415/615 46
- Cmu comp bio
- Carnegie mellon interdisciplinary
- Carnegie mellon software architecture
- Bomb lab secret phase
- Carnegie mellon software architecture
- Cmu sparcs
- Cmu mism
- Randy pausch carnegie mellon
- Kevin thompson nsf
- Carnegie mellon
- Carnegie mellon
- Cmu vpn
- Carnegie mellon
- Carnegie mellon
- Carnegie mellon
- Frax
- Carnegie mellon fat letter
- 15-513 cmu
- Cmu bomb lab
- Mellon serbia iskustva
- Carneigh mellon
- Conclusion of breastfeeding
- Wageworks health equity
- Mellon tubes
- Water mellon
- Mellon elf
- Mellon elf
- Mellon elf
- Scs desco
- Applied hydrology
- Lluvia neta
- Lengkung scs
- Scs curve number
- Simbol scs
- Scs curve number
- Scs tiristor
- Color 9132005
- Scs.ryerson.ca harley
- Simbol scs
- Scs reasonable person principle
- Scs thyristor
- Scs carleton
- Scs archiver
- Diagram superelevasi scs
- Scs elogs
- Scs lulu
- Scs methode