Variations of the Star Schema Benchmark to Test
Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance TILMANN RABL, MEIKEL POESS, HANS-ARNO JACOBSEN, PATRICK AND ELIZABETH O’NEIL ICPE 2013, PRAGUE, 24/04/2013 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG. ORG
Real Life Data is Distributed Uniformly… Well, Not Really ◦ Customers zip codes typically clustered around metropolitan areas ◦ Seasonal items (lawn mowers, snow shovels, …) sold mostly during specific periods ◦ US retail sales: ◦ peak during Holiday Season ◦ December sales are 2 x of January sales Source: US Census Data RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 2
Student Seminar Signup Distribution RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 3
How Can Skew Effect Database Systems? Data placement ◦ Partitioning ◦ Indexing Data structures ◦ Tree balance ◦ Bucket fill ratio ◦ Histograms Optimizer finding the optimal query plan ◦ Index vs. non-index driven plans ◦ Hash join vs. merge join ◦ Hash group by vs. sort group by RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 4
Agenda Data Skew in Current Benchmarks Star Schema Benchmark (SSB) Parallel Data Generation Framework (PDGF) Introducing Skew in SSB RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 5
Data Skew in Benchmarks TPC-D (1994 -1999): only uniform data ◦ SIGMOD 1997 - “Successor of TPC-D should include data skew” ◦ No effect until … TPC-DS (released 2012) ◦ Contains comparability zones ◦ Not fully utilized TPC-D/H variations ◦ Chaudhuri and Narayasa: Zipfian distribution on all columns ◦ Crolotte and Ghazal: comparability zones Still lots of open potential RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 6
Star Schema Benchmark I Star schema version of TPC-H ◦ ◦ Merged Order and Lineitem Date dimension Dropped Partsupp Selectivity hierarchies ◦ C_City C_Nation C_Region ◦ … RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 7
Star Schema Benchmark II Completely new set of queries 4 flights of 3 -4 queries ◦ Designed for functional coverage and selectivity coverage ◦ Drill down in dimension hierarchies ◦ Predefined selectivity Q 1. 1 select from where and and sum(lo_extendedprice*lo_discount ) as revenue lineorder, date lo_orderdate = d_datekey Dri d_year = 1993 lldo lo_discount between 1 and 3 wn lo_quantity < 25; Q 1. 2 select from where and and sum(lo_extendedprice*lo_discount ) as revenue lineorder, date lo_orderdate = d_datekey d_yearmonthnum = 199301 lo_discount between 1 and 3 lo_quantity between 26 and 35; RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 8
Parallel Data Generation Framework Generic data generation framework Relational model ◦ Schema specified in configuration file ◦ Post-processing stage for alternative representations Repeatable computation ◦ Based on XORSHIFT random number generators ◦ Hierarchical seeding strategy Frank, Poess, and Rabl: Efficient Update Data Generation for DBMS Benchmarks. ICPE '12. Rabl and Poess: Parallel Data Generation for Performance Analysis of Large, Complex RDBMS. DBTest '11. Poess, Rabl, Frank, and Danisch: A PDGF Implementation for TPC-H. TPCTC '11. Rabl, Frank, Sergieh, and Kosch: A Data Generator for Cloud-Scale Benchmarking. TPCTC '10. RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 9
XML PDGF DB Configuring PDGF Schema configuration Relational model ◦ Tables, fields Properties ◦ Table size, characters, … Generators ◦ Simple generators ◦ Metagenerators Update definition ◦ Insert, update, delete ◦ Generated as change data capture <table name="SUPPLIER"> <size>${S}</size> <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> <gen_Id. Generator /> </field> <field name="S_NAME" size="25" type="VARCHAR"> <gen_Pre. Postfix. Generator> <gen_Padding. Generator> <gen_Other. Field. Value. Generator> <reference field="S_SUPPKEY" /> </gen_Other. Field. Value. Generator > <character>0</character> <pad. To. Left>true</pad. To. Left> <size>9</size> </gen_Padding. Generator > <prefix>Supplier </prefix> </gen_Pre. Postfix. Generator> </field> [. . ] RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 10
Opportunities to Inject Data Skew in Foreign key relations ◦ E. g. , L_PARTKEY One fact table measures ◦ E. g. , L_Quantity Single dimension hierarchy ◦ E. g. , P_Brand → P_Category → P_Mfgr Multiple dimension hierarchies ◦ E. g. , City → Nation in Supplier and Customer Experimental methodology ◦ ◦ One experiment series for each of the above Comparison to original SSB Comparison of index-forced, non-index, and automatic optimizer mode SSB scale factor 100 (100 GB), x 86 server RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 11
Skew in Foreign Key Relations Very realistic Easy to implement in PDGF ◦ Just add a distribution to the reference <distribution name="Exponential“ lambda="0. 26235" /> But! Dimension attributes uniformly distributed Dimension keys uncorrelated to dimension attributes Very limited effect on selectivity Focus on attributes in selectivity predicates RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 12
Skew in Fact Table Measure – Lo_Quantity distribution ◦ Values range between 0 and 50 ◦ Originally uniform distribution with: ◦ P(X=x)=0. 02 ◦ Coefficient of variation of 0. 00000557 ◦ Proposed skewed distribution with: ◦ Query 1. 1 ◦ lo_quantity < x, x ∈ [2, 51] Results ◦ Switches too early to non-index plan ◦ Switches too late to non-index plan ◦ Optimizer agnostic to distribution RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 13
Skew in Single Dimension Hierarchy - Part P_Category distribution ◦ Uniform P(X=x)=0. 04 ◦ Skewed P(X=x)= 0. 01 - 48. 36 ◦ Probabilities explicitly defined Query 2. 1 ◦ Restrictions on two dimensions Results uniform case ◦ Index driven superior ◦ Optimizer chooses non-index driven Results skewed case ◦ Switches too early to non-index plan RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 14
Skew in Multiple Dimension Hierarchies – S_City & C_City Skewed S_City & C_City ◦ Probabilites exponentially distributed Query 3. 3 Join Cardinality Elapsed Time ◦ Restrictions on 3 dimensions ◦ Variation on Supplier and Customer city Results uniform and skewed cases ◦ Automatic plan performs best ◦ Cross over between automatic uniform and skewed too late RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 15
Conclusion & Future Work PDGF implementation of SSB Introduction of skew in SSB Extensive performance analysis ◦ Several interesting optimizer effects ◦ Performance impact of skew Future Work Further analysis on impact of skew Skew in query generation Complete suite for testing skew effects RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 16
Thanks Questions? Download and try PDGF: http: //www. paralleldatageneration. org (scripts used in the study available on website above) RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 17
Back-up Slides RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 18
Configuring PDGF Generation configuration Defines the output ◦ ◦ Scheduling Data format Sorting File name and location Post processing ◦ ◦ Filtering of values Merging of tables Splitting of tables Templates (e. g. XML / queries) <table name="QUERY_PARAMETERS" exclude="false" > <output name="Compiled. Template. Output" > [. . ] <template ><!-int y = (fields [0]. get. Plain. Value ()). int. Value (); int d = (fields [1]. get. Plain. Value ()). int. Value (); int q = (fields [2]. get. Plain. Value ()). int. Value (); String n = pdgf. util. Constants. DEFAULT_LINESEPARATOR; buffer. append("-- Q 1. 1" + n); buffer. append("select sum(lo_extendedprice *"); buffer. append(" lo_discount) as revenue" + n); buffer. append(“ from lineorder , date" + n); buffer. append(“ where lo_orderdate = d_datekey" + n); buffer. append(“ and d_year = " + y + n); buffer. append(“ and lo_disc between " + (d - 1)); buffer. append(“ and " + (d + 1) + n); buffer. append(“ and lo_quantity < " + q + "; " + n); --></template > </output > </table > RABL, POESS, JACOBSEN, O'NEIL - SSB SKEW VARIATIONS 19
- Slides: 19