Tampa Bay Relational Users Group IBM Silicon Valley

Tampa Bay Relational Users Group IBM Silicon Valley Lab, U. S. A. DB 2 for z/OS Query Optimization Ó IBM Corporation 2003 Ó IBM Corporation 2010

2 Filter factor issues • Filter factor accuracy important for. . . – Index matching • Accurately estimate index cost – Total index filtering • Estimate table access cost via index(es) • Choose how to use index (prefetch? ) – Total table level filtering • Efficient join order • Efficient join method • Appropriate sorts DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

3 Terminology • Correlation – When data on two columns is not independent – Eg. CITY, STATE • Every city does not exist in every state. • Data Skew (or skew) – Describes situation where data is non-uniformly distributed – Data can be point-skewed on a value or skewed over a range – Eg. Gender • Domain (M, F) • 35% = M, 65% = F DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

4 Terminology (cont. ) • MFREQ: Multi-column frequency – Frequency on concatenated column group – MFREQ(C 1, C 2, C 3) • MCARD: Multi-column cardinality – Multi-column cardinality on a column group – MCARD(C 1, C 2, C 3) DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

5 Selectivity statistics • Single column – Cardinality – HIGH 2 KEY/LOW 2 KEY – Frequency • Multi-column – Cardinality – Frequency – Histogram DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

6 Single column cardinality • Single column cardinality – Number of distinct values for a column – Assumes uniform distribution – Stored as • SYSCOLUMNS. COLCARDF • SYSINDEXES. FIRSTKEYCARDF – Used when better statistics can't be used. . . • Host variables, parameter markers, special registers • No other statistics available DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

7 RUNSTATS column cardinality • How to collect – RUNSTATS command: RUNSTATS TABLESPACE (DBNAME. TSNAME) TABLE (ALL or PAT_TABLE) COLUMN(ALL or <list of columns>) – Leading column of index when RUNSTATS on index performed RUNSTATS INDEX (PAT_INDEX) RUNSTATS TABLESPACE (DBNAME. TSNAME) INDEX(ALL) DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

8 Single column cardinality Select C 2 from T 1 Where C 1 = ? Index I 1: C 1, C 2, C 3 T 1 CARDF = 100, 000 C 1 COLCARDF = 5 I 1 NLEAF = 10, 000 I 1 NLEVELS = 3 • For equals predicate, filter factor = 1/COLCARDF – Index pages --> probe + matching FF * NLEAF • 3 + (1/5) * 10, 000 = 2003 index pages – Index record ids processed = CARDF * matching index filtering • 100, 000 * (1/5) = 20, 000 – Rows returned = CARDF * total filtering • 100, 000 * (1/5) = 20, 000 DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

9 Single column cardinality Select C 3 from T 1 Where C 1 = ? AND C 2 = ? Index I 1: C 1, C 2, C 3 C 1 COLCARDF = 5 C 2 COLCARDF = 10 FULLKEYCARDF 65, 000 • Two matching predicates, multiply filter factors – Index pages --> probe + matching FF * NLEAF • 3 + [(1/5) * (1/10)] * 10, 000 = ~203 index pages – Index record ids processed = CARDF * matching index filtering • 100, 000 * [(1/5) * (1/10)] =~ 2, 000 – qualified rows = CARDF * total filtering • 100, 000 * [(1/5) * (1/10)] =~ 2, 000 rows DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

10 Single column cardinality Select C 2 from T 1 Where C 1 > ? Index I 1: C 1, C 2, C 3 C 1 COLCARDF = 10, 121 • Range predicate with parameter marker – Use default interpolation filter factor chart • COLCARDF 10, 121 --> FF = 1/100 – In reality, could qualify anywhere from all to no rows • Here's another sample predicate: – BIRTH_DATE <= ? • How many people in room born before parameter marker? – What if value is '1930 -01 -01'? – What if value is '1980 -01 -01? – Cannot accurately estimate without literal value DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

11 Range predicate interpolation Table 104. Default filter factors for interpolation COLCARDF Factor for Op >=100, 000 1/10, 000 >=10, 000 1/3, 000 >=1, 000 1/1, 000 >=100, 000 1/300 >=10, 000 1/100 >=1, 000 1/30 >=100 1/10 >=0 1/3 Note: Op is one of these operators: <, <=, >, >=. Factor for LIKE/BETWEEN 3/100, 000 1/10, 000 3/10, 000 1/1, 000 3/1, 000 1/100 3/100 1/10 COMMENT: This is DB 2’s documented guess for an impossible to estimate Filter factor. DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

12 Single column cardinality Select C 4 from T 1 Where C 1 = ? AND C 3 > ? Index I 1: C 1, C 2, C 3 CARDF = 1 million NPAGES/F = 100, 000 NLEAF = 10, 000 C 1 COLCARDF 10 C 3 COLCARDF 10, 121 clusterratiof 50 • Matching cost – Index pages --> probe + matching FF * NLEAF • probe + (1/10) * 10, 000 =~ 1, 003 pages – Index rows processed = CARDF * Matching FF • 1, 000 * (1/10) = 100, 000 rows • Screening – Rows to access table for = CARDF * (Matching and screening FF) • 1, 000 * [(1/10) * (1/100)] =~ 1, 000 rows? ? ? DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

13 HIGH 2 KEY/LOW 2 KEY • HIGH 2 KEY/LOW 2 KEY – Single column statistic • SYSCOLUMNS. HIGH 2 KEY • SYSCOLUMNS. LOW 2 KEY – When used? • • • Interpolation used to estimate range predicates Like, between, <, <=, >, >= Literal value must be known As domain statistics when COLCARDF = 1 or 2 Can be used in combination with single column frequencies for more accurate estimate. • DB 2 Interpolation: Technique to estimate the percentage of rows which qualify based on known high / low values. DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

14 RUNSTATS HIGH 2 KEY / LOW 2 KEY • How to collect – Whenever single column cardinality collected, HIGH 2 KEY / LOW 2 KEY also collected. – Reference RUNSTATS COLUMN CARDINALITY slide DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

15 Linear interpolation Select C 4 from T 1 Where C 1 = ? AND C 3 > 50 Index I 1: C 1, C 2, C 3 CARDF = 1 million NPAGES/F = 100, 000 C 3 COLCARDF 10, 241 LOW 2 KEY 0 HIGH 2 KEY 100 clusterratiof 50 • Matching cost - same as before • Screening – Rows to access table for = CARDF * (Matching + screening FF) – 1, 000 * [(1/10) * (screening FF)] =~ ? ? ? – Interpolation for C 3 > 50 • (HIGH 2 KEY - LITVALUE) / (HIGH 2 KEY - LOW 2 KEY) • (100 - 50) / (100 - 0) = 50/100 = 0. 5 – 1, 000 * [(1/10) * (0. 5)] =~ 50, 000 rows (vs 1000 with def FF) DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

16 Single column frequencies • Single column frequencies – SYSCOLDIST. FREQUENCYF • TYPE = 'F', NUMCOLUMNS = 1 – Provides non-uniform distribution information • Data skew – When used? • • Literal value must be known Equals, is null, in Like, between, <, <=, >, >= Used in conjunction with other complementary statistics DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

17 RUNSTATS Frequency • Leading indexed column – PAT_INDEX (C 1, C 2, C 3) – Top 10 values RUNSTATS INDEX (PAT_INDEX) – Top 20 values RUNSTATS INDEX (PAT_INDEX FREQVAL NUMCOLS(1) COUNT(20)) – Top 0 values (purge frequencies) (PAT_INDEX FREQVAL NUMCOLS(1) COUNT(0)) DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

18 RUNSTATS Frequency • How to collect – RUNSTATS COLGROUP allows collection on almost any column RUNSTATS TABLESPACE DB 1. TS 1 TABLE (PAT_TABLE) COLUMNS(C 1, C 2, C 3) COLGROUP (C 1) FREQVAL COUNT(1) MOST COLGROUP (C 2) FREQVAL COUNT(10) LEAST COLGROUP (C 3) FREQVAL COUNT(20) BOTH – Eliminate existing frequencies: COLGROUP (C 3) FREQVAL COUNT(0) MOST DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

19 Single column frequency Select C 4 from T 1 Where C 1 = 'A' Index I 1: C 1, C 2, C 3 T 1 CARDF = 100, 000 C 1 COLCARDF = 5 I 1 NLEAF = 10, 000 I 1 NLEVELS = 3 C 1 'A' 'B' 'C' 'D' 'E' FREQ 0. 75 0. 15 0. 03 0. 02 • Value which exists a lot clusterratiof 50 – Index pages --> probe + matching FF * NLEAF • 3 + (0. 75) * 10, 000 =~ 7, 503 index pages – Index record ids processed = CARDF * matching index filtering • 100, 000 * (0. 75) =~ 75, 000 – Rows returned = CARDF * total filtering • 100, 000 * (0. 75) =~ 75, 000 DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

20 Single column frequency Select C 4 from T 1 Where C 1 = 'Z' Index I 1: C 1, C 2, C 3 T 1 CARDF = 100, 000 C 1 COLCARDF = 5 I 1 NLEAF = 10, 000 I 1 NLEVELS = 3 ? C 1 'A' 'B' 'C' 'D' 'E' FREQ 0. 75 0. 15 0. 03 0. 02 clusterratiof 0. 50 • Looking for value not in the domain. . – Index pages --> probe + matching FF * NLEAF • 3 + (~0) * 10, 000 =~ 1 index page – Filter factor without the frequencies • 1/5 = 0. 20 DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

21 Single column frequency Select C 2 from T 1 Where C 1 in ('C', 'Z', 'E') Index I 1: C 1, C 2, C 3 T 1 CARDF = 100, 000 C 1 COLCARDF = 5 I 1 NLEAF = 10, 000 I 1 NLEVELS = 3 C 1 'A' 'B' 'C' 'D' 'E' FREQ 0. 75 0. 15 0. 03 0. 02 • Some in domain, some not. . . – Index pages --> probe + matching FF * NLEAF • 3 + (0. 05 + 0. 02) * 10, 000 =~ 700 index page – Rows returned = CARDF * total filtering • 100, 000 * (0. 05 + 0. 02) =~ 7000 – Without frequencies filter factor = 3/5 = 0. 60 versus 0. 07 DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

22 Single column histograms • Single column frequencies – SYSCOLDIST. FREQUENCYF • TYPE = ‘H', NUMCOLUMNS = 1 – Provides non-uniform distribution information • Range skew – When used? • • Literal value must be known Equals, is null, in Like, between, <, <=, >, >= Used in conjunction with other complementary statistics DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

23 Histograms • How to collect – RUNSTATS COLGROUP allows collection on almost any column RUNSTATS TABLESPACE DB 1. TS 1 TABLE (PAT_TABLE) COLUMNS(C 1, C 2, C 3) COLGROUP (C 2) HISTOGRAM NUMQUANTILES 20 INDEX (I 1 KEYCARD HISTOGRAM NUMCOLS 1 NUMQUANTILES 30 , I 2 KEYCARD HISTOGRAM NUMCOLS 1 NUMQUANTILES 50 ) - 20 quantiles for column C 2. - 30 quantiles for leading column of index I 1. - 50 quantiles for leading column of index I 3. – Eliminate existing histograms: COLGROUP (C 3) HISTOGRAM COUNT(0) MOST DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

RUNSTATS Histogram Statistics 24 • RUNSTATS will produce equal-depth histogram – Each quantile (range) will have approx same number of rows • Not same number of values – Another term is range frequency • Example • 1, 3, 3, 4, 4, 6, 7, 8, 9, 10, 12, 15 (sequenced) – Lets cut that into 3 quantiles. • 1, 3, 3, 4 6, 7, 8, 9 10, 12, 15 Seq No Low Value 1 1 4 3 5/12 2 6 9 4 4/12 3 10 15 3 3/12 DB 2 for z/OS Query Optimization High Value Cardinality Frequency Ó IBM Corporation 2010

RUNSTATS Histogram Statistics Notes 25 • RUNSTATS – Maximum 100 quantiles for a column – Same value columns WILL be in the same quantile – Quantiles will be similar size but: • • • Will try to avoid big gaps inside quantiles Highvalue and lowvalue may have separate quantiles Null WILL have a separate quantile • Supports column groups as well as single columns • Think “frequencies” for high cardinality columns DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

26 Histogram Statistics Example • Customer uses INTEGER (or VARCHAR) for YEAR-MONTH WHERE YEARMONTH BETWEEN 200601 AND 200612 • Assuming data for 2006 & 2007 – – FF = (high-value – low-value) / (high 2 key – low 2 key) FF = (200612 – 200601) / (200711 – 200602) – 10% of rows estimated to return Data assumed as evenly distributed between low and high range DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

Histogram Statistics Example 27 • Example (cont. ) – Data only exists in ranges 200601 -12 & 200701 -12 • Collect via histograms – 45% of rows estimated to return No data between 200613 & 200700 DB 2 for z/OS Query Optimization WHERE YEARMONTH BETWEEN 200601 AND 200612 Ó IBM Corporation 2010

Single column recommendations • 28 Cardinality – Collect on all columns used in where clause – Used regardless of literal value known • Interpolation (HIGH 2 KEY/LOW 2 KEY) – Collected with column statistics – Consider REOPT(VARS) – Dynamic • V 8 - REOPT(ONCE) • DB 2 9 – REOPT(AUTO) • Frequency – – • Optimizer requires literal value to use Used for most predicate types Collect all values for low COLCARDF columns Useful for indexed and non-indexed columns Histograms – Optimizer needs literal value to use – Can be used in virtually all scenarios frequencies are used, also join predicates – Collect histograms rather than frequencies for RANGE predicates DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

29 Filter factor issues • Filter factor accuracy important for. . . – Index matching • Estimate index cost – Total index filtering • Estimate table access cost via index(es) • Choose how to use index (prefetch? ) – Table filtering • Efficient join order • Efficient join method • Appropriate sorts DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

30 Multi-column cardinalities • Multi-column cardinalities (MCARD) – Stored in a few places. . . • SYSINDEXES. FULLKEYCARDF • SYSCOLDIST. CARDF – TYPE = 'C', NUMCOLUMNS > 1 – Assumes uniform distribution – When used? • Primarily for indexes • Literal values not necessary • KEYCARD for partially matching indexes – Collect for all indexes with 3 or more columns • Collect to support multi-column frequencies • Collect for all multi-column join situations DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

31 RUNSTATS KEYCARD • How to collect – V 7 RUNSTATS only collects KEYCARD on leading column of index – By default, RUNSTATS only collects FIRST/FULLKEYCARDF INDEX PAT_INDEX (C 1, C 2, C 3, C 4) RUNSTATS INDEX(PAT_INDEX KEYCARD) – MCARD on leading concatenated column groups: – MCARD(C 1, C 2), MCARD(C 1, C 2, C 3) DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

32 RUNSTATS COLGROUP • DB 2 V 8 allows collection of MCARD on any column group RUNSTATS TABLESPACE DB 1. TS 1 TABLE(PAT_TABLE) COLUMN(C 1, C 2, C 3, C 4) COLGROUP(C 1, C 4) Specifying COLGROUP with multiple columns collects multi-column cardinality on the group. DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

33 Multi-column cardinality • Useful for local predicates SELECT FROM WHERE AND name, address, . . . CUST City = ? State = ? Last_name = ? INDEX Index column 1 stkey KEYCARD FULLKEY IX 1 City, State, Zip IX 2 Last_name 10, 000 20, 000 12, 000 N/A 1, 000 20, 000 COLCARDF STATE IX 1 Filter Factor without KEYCARD IX 1 Filter Factor WITH KEYCARD Last_name DB 2 for z/OS Query Optimization 50 1 / 500, 000 1 / 12, 000 1 / 20, 000 keycard!!! Ó IBM Corporation 2010

34 Multi-column cardinality • Useful for join predicates SELECT FROM WHERE AND cols T 1, T 2 T 1. C 1 = T 2. C 1 T 1. C 2 = T 2. C 2 INDEX I 1 (C 1, C 2) on table T 1 INDEX I 2 (C 1, C 2, C 3) on table T 2 KEYCARD is necessary on index I 2 to accurately estimate T 1 T 2 join size. DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

35 Multi-column cardinality • Matching + Screening SELECT FROM WHERE AND cols T 1. C 1 = ? T 1. C 3 = ? T 1. C 4 = ? INDEX I 2 (C 1, C 2, C 3) on table T 1 COLCARDF / FIRSTKEYCARDF for matching only COLGROUP (C 1, C 3) matching + screen COLGROUP (C 1, C 3, C 4) table level filtering DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

36 Filter factor issues (reminder) • Filter factor accuracy important for. . . – Index matching • Estimate index cost – Total index filtering • Estimate table access cost via index(es) • Choose how to use index (prefetch? ) – Table filtering • Efficient join order • Efficient join method • Appropriate sorts DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

37 Multi-column frequency • Multi-column frequencies – Very similar to single column frequencies • Distribution statistics concatenated column group values • Identifies multi-column skewed distributions – Stored in • SYSCOLDIST. FREQUENCYF • TYPE = ‘F’ • NUMCOLUMNS > 1 DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

38 RUNSTATS INDEX • How to collect – V 7 RUNSTATS only collects on leading concatenated column of index – RUNSTATS does NOT collect multi-column frequencies by default. – Must be explicitly requested. INDEX (I 1) columns (C 1, C 2, C 3) – Collect top 15 values for column group (C 1, C 2) RUNSTATS INDEX (I 1 FREQVAL NUMCOLS(2) COUNT(15)) – Eliminate frequencies on column group (C 1, C 2): RUNSTATS INDEX (I 1 FREQVAL NUMCOLS(2) COUNT(0)) DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

39 RUNSTATS COLGROUP • DB 2 V 8 allows collection of multi-column frequencies on almost any column group – Examples RUNSTATS TABLESPACE DB 1. TS 1 TABLE (T 1) COLUMN(C 1, C 2, C 3) COLGROUP(C 1, C 3) FREQVAL COUNT(10) MOST COLGROUP(C 2, C 3) FREQVAL COUNT(1) LEAST – Eliminate frequencies on column group (C 1, C 3): COLGROUP (C 1, C 3) FREQVAL COUNT(0) DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

40 Multi-column frequency • Multi-column frequencies – Limited use • Boolean equal predicates only • Always collect supporting multi-column cardinality – Collect single column frequencies for • • Range predicates In-lists Single column predicates other non-equal predicates DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

41 Multi-column frequency Gender COLCARDF = 2 Category COLCARDF = 4 (Gender, Category) MCARD = 8 Category Gender Women's Health F 0. 2375 0. 125 Women's Health M 0. 0125 0. 125 Men's Health M 0. 2375 0. 125 Men's Health F 0. 0125 0. 125 Hockey M 0. 15 0. 125 Hockey F 0. 10 0. 125 Soccer F 0. 15 0. 125 Soccer M 0. 10 0. 125 DB 2 for z/OS Query Optimization FREQUENCYF 1/MCARD Ó IBM Corporation 2010

42 Multi-column frequency trap • Assumptions – – Table T 1 cardinality 150, 000 Index I 1 (GENDER, ACCT_NUM) COLCARDF GENDER 2 COLCARDF ACCT_NUM 120, 000 • SQL SELECT * FROM T 1 WHERE GENDER = ‘M’ • Global RUNSTATS with FREQVAL used – RUNSTATS INDEX (ALL) FREQVAL NUMCOLS (6) COUNT(10) – (Don’t do this!) DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

Useless multi-column frequency 43 FREQUENCYF GENDER ACCT_NUM M 1541235 0. 00008 F 81235135 0. 00008 M 423613246 0. 00008 F 32151234 0. 00008 M 76823414 0. 00008 M 43451235 0. 00008 F 1235 0. 00008 • Useless because… – Frequencies on GENDER alone either non-existant, or collected long ago and becoming stale – Multi-column frequency on (GENDER, ACCT_NUM) not used since only one column exists – Even if frequency could be used – ACCT_NUM is so selective, the frequencies are too diluted to tell us anything about GENDER alone DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

44 Multi-column considerations • Cardinality – Collect KEYCARD for all indexes with 3 or more columns – Literal values not required • Frequencies – – – Not useful when literal values aren't known Dynamic SQL prepared with parameter markers Static SQL with hostvars (consider REOPT) Special registers (consider REOPT) Collect for specific cases • Pay special attention to. . . – Low cardinality column groups – Volatile data DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

45 Join considerations Select T 1. C 4, T 2. C 6 from T 1, T 2 Where T 1. C 1 = T 2. C 1 AND T 1. C 2 = 50 AND T 2. C 3 = 'A' AND T 2. C 4 = 'B' T 1 & T 2 CARDF = 10 million T 1. C 2 COLCARDF = 30 T 2. C 3 COLCARDF = 10 T 2. C 4 COLCARDF = 15 Unique indexes T 1. C 1 T 2. C 1 • Which table should be outer? – Hmmm, low cardinality predicates. . – Significant opportunity for misestimation on both T 1 and T 2 – could lead to poor choice. • The more tables in the join and the more predicates involved - the more important this becomes DB 2 for z/OS Query Optimization Ó IBM Corporation 2010

46 One more time • Filter factor accuracy important for. . . – Index matching • Estimate index cost – Total index filtering • Estimate table access cost via index(es) • Choose how to use index (prefetch? ) – Table filtering • Efficient join order • Efficient join method • Appropriate sorts DB 2 for z/OS Query Optimization Ó IBM Corporation 2010