VERTICA DB Optimization Level 1 Storage Types Row

VERTICA: DB Optimization Level 1

Storage Types: Row vs Column Table: Computer memory is structured linearly, so there are two options: 2

Storage Types: Data Access Column Store • Only affected columns are read during selection o Affected: selected, used in where, used in join, etc. Row Store • Entire row must be read during execution select sum(revenue) from table where date >= 01/02/2016 3

Storage Types: Data Retrieval select * from table where date = 01/02/2016 Column Store • By using “selecting *” on entire tables, you eliminate most of the benefits of column store databases 4

Storage Types: Data Retrieval select date, sum(revenue) from table where date >= 01/02/2016 Column Store • In this example, column store read significantly less than row store would. 5

Create: Primary Key When do I need a PK? • Always… • Very rare occasions when this is not true, but not having them is a bad habit What is it that they do here? • Help validate your data for duplicates o Vertica allows you to load it, but won’t let you join to the table o Gives you the ability to choose which record is correct and remedy the underlying issue • Joins o Duplicated data on a correct join o Especially important on dimension tables • Segmentation o Helps with distribution across nodes o Helps avoid high data skew (ex: node 01 has way more data than node 02) Primary Key Compound Primary Key create table bi_depot. example_table ( event_dt date, id_1 int 8 primary key, id_2 int 8, id_3 int 8, imps int 8, rev numeric ) create table bi_depot. example_table ( event_dt date, id_1 int 8, id_2 int 8, id_3 int 8, imps int 8, rev numeric, primary key (event_dt, id_1) ) order by event_dt 6

Create: Projections Super Projection • Stores all columns in a table • Physical structures stored on “disk” • Structure of super projections and partial projections affect query and load speed • Ex: Order, Partitioning, Segmentation Partial Projection • Can contain subsets of data to speed querying • More projections will increase disk space requirements and load times • Remember – physical presence, so must be loaded • You probably shouldn’t do this. . . This is just an FYI But I have never built a projection… • You have! Super Projections are created with your table • If you do not specify the projection criteria in your create, Vertica will. Our focus today? • Ordering • Segmentation • Partitioning Why is it important? • Query performance! 7

Create: Ordering Projections What does it do? • HUGE effect on query performance • Also affects Vertica’s compression Best Practices • Order by lowest cardinality column that is included in most queries • Fact table ordering is an art, not a science • Dimension tables and oversimplified examples are easier… 8

Create: Ordering Projections Lowest Cardinality column that is most likely to be in the “where”… Order first by the “standing”. 9

Create: Ordering Projections What would you order by next? 10

Create: Ordering Projections Gender? It is next in line according to cardinality, but how often will it be queried? Note: If this were a table that a real college were using, I would order by this next… 11

Create: Ordering Projections Graduation year probably makes the most sense to go next if this is a smaller table. It will likely be queried FAR more often than gender, but this is why ordering is an art! 12

Create: Ordering Projections create table bi_depot. example_table ( event_dt date, id_1 int 8, id_2 int 8, id_3 int 8, imps int 8, rev numeric, primary key (event_dt, id_1) ) order by event_dt FACTS: Almost always makes most sense to order first by date. DIMS: Almost always makes most to order by the PK 13

Create: Order Example IRL Ordering is not a science. There is always going to be an argument for a different sort, but think it through because it makes a HUGE difference! Note: PK does not have to match the order. I just like to arrange the columns by my sort. Other Note: Notice that not all my columns are included in the sort. You don’t have to include them all. Date first, but then the MOST queried LOWEST cardinality dim. Cardinality is not the only consideration. This column has 4 values, but is rarely queried. When a column is uniquely derived form a lower cardinality column, always go low. EX: state_id is better than city_id 14

Segmentation: What is it? Nodes want to help each other out… Don’t stand in the way! 15

Segmentation: What is it? What does it do? • • Defines how data is spread among cluster nodes for fast data purges and query performance. The goal is to distribute data evenly across the database nodes so that all nodes can participate (somewhat evenly) in query execution. Best Practices • Segment by high cardinality columns • NO NULLS IMPORTANT • All columns with the same value end up on the same node. Example • You have a 10 Node DB • You segment on a column with 5 unique values Result = 5 nodes with data, 5 with nothing. You have effectively cut your processing power in half Pop Quiz • Why would it make sense to leave dimension tables “unsegmented” across all nodes? 16

Segmentation: Check your work Query select node_name, projection_name, row_count from projection_storage where anchor_table_name = 'mktplc_total_supply_day' and lower(anchor_table_schema) = 'bi_depot' and projection_name ilike '%_b 0' order by 2, 1 17

Partitioning: What is it? What does it do? • Organize data into storage containers • Improves query performance Best Practices • Integer based on the date • Match your retention/restate interval Note • Partitioning can get pretty complicate • For level 2, we are going to stop here 18

Level 3: You ready? Contents • Tuple Mover • ROS Containers • WOS Containers • Data load optimization • Query Optimization • Explain (Verbose) • Reading/Understand Query Plans • V_CATALOG Schema • V_MONITOR Schema 19

QUESTIONS