Introduction to Flexviews 1 5 Improving database performance

Introduction to Flexviews 1. 5 Improving database performance using incrementally refreshable materialized views http: //flexviews. sourceforge. net Flexviews authored by: Justin Swanhart Copyright 2007 -2009 Flexviews released under the LGPL version 3

Introduction v. Who am I? v. What do I do? v. Why did I write Flexviews?

Requirements v My. SQL 5. 1. 30 or greater v binlog_format=ROW v User with REPLICATION SLAVE and/or SUPER v PHP 5. 2 (optionally pcntl and mysqli) v Works best on a My. SQL slave v Basic SQL knowledge v log_slave_updates=true

What are views? v A view is a SQL statement that is saved in the database and executed on demand. Views have a “first class” name and can be referenced as if they were tables themselves. v Because they abstract SQL statements and can be used for security, views are considered very convenient. When used properly views can add tremendous value to the database, but when used incorrectly they have many well known performance problems: v Queries can take a lot of time. v Because a view is not an actual table it can’t be indexed. v Because the query is executed each time, the query plan may change unexpectedly. v A temporary table is almost always required.

What are materialized views? v In an RDBMS, a query is “materialized” when its contents are stored (cached) in a table. v A traditional view represents a SQL statement which is executed each time the view is accessed. In contrast, a materialized view is a view which contains the result of a query from some point in time in the past. It is essentially a table which contains the cached results of a query. v To be useful, a materialized view should be able to be updated incrementally. This means that the view should be updated in place to reflect changes to the base tables. v Base tables are the underlying tables accessed by a view.

Oracle Materialized Views v Uses special CREATE MATERIALIZED VIEW. . SELECT syntax. v Uses materialized view logs. v Built on extensible Oracle features and stored procedures (part of Oracle) v Supports SELECT/PROJECT/AGGREGATE/JOIN queries. v Supports three different refresh methods v IMMEDIATE refresh updates a view as soon as the underlying base table changes. Incurs high cost to each transaction. v FULL refresh. This is an asynchronous method where the table is rebuilt completely from scratch. The view is usually unavailable during rebuild. v FAST refresh. This is an asynchronous method where the table upated only with the changes to base tables since the view was last refreshed. v Was the first commercial database to implement materialized views.

DB 2 Materialized Query Tables (MQT) v Uses an extended CREATE TABLE AS SELECT syntax with extensions for specifying MQT options such as refresh method. v Uses MQT staging tables v Built on extensible DB 2 features (part of DB 2) v Supports SELECT/PROJECT/AGGREGATE/JOIN queries v Supports three different refresh methods v Immediate refresh updates a view as soon as the underlying base table changes. Incurs high cost to each transaction. v INCREMENTAL refresh. This is an asynchronous method where the MQT is updated from a staging table. v NON-INCREMENTAL refresh. This is an asynchronous method where the contents are rebuilt from scratch.

My. SQL v My. SQL features CREATE TABLE AS SELECT (CTAS) syntax which will build a table from the results of a SELECT statement. v A CTAS created table carries no link to the source tables and can not be “refreshed”. v Lacking any refresh capability, tables created with CTAS may be difficult to manage. v As far as I know, there are no free or open source tools available (except Flexviews) which support asynchronously maintaining materialized views, particularly those which contain joins.

Flexviews adds materialized views to My. SQL v Flexviews uses stored procedures to create and manage materialized views which can be incrementally refreshed. v It includes a “change data capture” tool. This tool uses My. SQL commands and invokes the ‘mysqlbinlog’ tool to collect and decode RBR binary logs and uses that information to insert information into table change logs. v The refresh process reads from change logs and base tables to update the view to some point in time after it was created or last refreshed.

Flexviews v My. SQL doesn’t expose the SQL parser. As such, Flexviews features an API for building queries. v The API is simple enough such that if you can write basic SQL statements you can create a similar view. v Supports SELECT/PROJECT/AGGREGATE/JOIN queries. v Built on extensible My. SQL features (the binary log) and stored procedures. v Supports two different refresh methods v INCREMENTAL refreshes asynchronously from change logs and delta tables. v COMPLETE refresh. The contents are rebuilt from scratch and atomically replaced with RENAME TABLE.

Demo of the binlog consumer in action My. SQL server writes binary logs in ROW format Flexviews binlog consumer v The binlog consumer is a PHP script which runs in the background. v All changes are logged as an INSERT or a DELETE. UPDATE is logged as DELETE followed by INSERT. v. Only tables selected for change logging will be processed. Change logs are written Into the database. “Unit of work” table is maintained v. The consumer must be running in order to refresh materialized views.

Why not use triggers instead of the binlog consumer? v Triggers can not capture discreet transaction order without external workload serialization. v They require accessing the information_schema for every row change which incurs extreme performance penalties. v The cost of every transaction is increased since table change logs are written synchronously. v Triggers impose serious performance overheard. v Triggers require the serializable isolation level to ensure valid results. v Table change logs can not be captured from a remote My. SQL server using triggers.

Flexviews Asynchronous Refresh v The refresh algorithm executes in two stages v Delta computation (COMPUTE phase) v Delta application (APPLY phase) v The COMPUTE phase is executed independently from the APPLY phase. v You can COMPUTE changes any number of times before applying them. v The work of maintaining the view is split up into many small COMPUTE jobs, each which process only the rows which have changed since the last computation. v You have the option of refreshing a group of views to the same exact transactional state.

Flexview Delta Application v Periodically, rows in the delta tables can be applied to the Flexview. v A combination of INSERT. . ON DUPLICATE KEY UPDATE and DELETE are used to bulk INSERT, UPDATE and DELETE rows from the Flexview. v You may optionally include a UOW_ID which represents a point in time to which to refresh the view. Refreshing multiple views to the same UOW_ID will make them transactionally consistent with each other. v Views may only be rolled forward to a higher UOW_ID. You may not roll a view backward in time (yet).

Aggregation v In order to achieve top query performance aggregate tables must be employed. v For example, in a data mart with sales fact data, there may be aggregates calculated by customer by day, or by salesrep by month. v Flexviews supports SUM and COUNT and experimentally supports AVG/MIN/MAX and COUNT(DISTINCT). v Additional support for GROUP_CONCAT and the VAR_SAMP and VAR_POP functions coming soon. v Using aggregate functions outside of SUM/COUNT/AVG incurs additional storage and processing costs, so use these functions sparingly. v Views with aggregation can also include joins.

Using an aggregate view Finding the top 10 most popular items by sales. Using the Flexview Using the fact table SELECT item_id, net_amt, CNT FROM item_sales_mv ORDER BY CNT DESC LIMIT 10; SELECT f. item_id, sum(f. qty * f. amt) net_amt, count(*) CNT FROM fact f WHERE f. status = ‘CLOSED’ GROUP BY f. item_id ORDER BY CNT desc LIMIT 10;

Simple aggregation example CALL flexviews. create('large', 'item_sales_mv', 'INCREMENTAL'); SET @mvid : = LAST_INSERT_ID(); -- Add the fact table CALL flexviews. add_table(@mvid, 'large', 'fact', 'f', NULL); -- Add a GROUP BY expression CALL flexviews. add_expr(@mvid, 'GROUP', 'f. item_id', 'status'); -- Add a SUM aggregate expression CALL flexviews. add_expr(@mvid, 'SUM', 'qty * amt', 'net_amt'); -- Add a COUNT aggregate expression CALL flexviews. add_expr(@mvid, 'COUNT', '*', 'CNT'); -- Add a WHERE clause CALL flexviews. add_expr(@mvid, 'WHERE', 'f. status="CLOSED"', 'where'); -- Create the view CALL flexviews. enable(@mvid);

JOINS v In a STAR schema, the most common aggregates include joins to dimension tables. This provides the means for slicing data by customer group, or by item type or date, etc. v Performance can be improved by pre-calculating the JOIN between tables and/or aggregating data. v Pre-compute master/detail relationships for improved report performance. v By including additional columns which are commonly filtered on performance can be improved significantly. v Maintaining such a table without Flexviews is difficult

Join with aggregation view Item sales by day and by item category. Using the Flexview SELECT * FROM item_category_month_mv Using the base tables SELECT i. item_category as category, d. year as year, d. month as month, SUM(qty * amt) as net_amt, COUNT(*) as CNT FROM large. fact as f JOIN large. dim_item as i ON i. item_id = f. item_id JOIN large. dim_date as d ON d. date_id = f. date_id WHERE f. status="CLOSED" GROUP BY i. item_category, d. year, d. month

Join with aggregation example -- Create the materialized view placeholder CALL flexviews. create('large', 'item_category_month_mv', 'INCREMENTAL'); SET @mvid : = LAST_INSERT_ID(); -- Add the fact table CALL flexviews. add_table(@mvid, 'large', 'fact', 'f', NULL); CALL flexviews. add_table(@mvid, 'large', 'dim_item', 'i', 'ON i. item_id = f. item_id'); CALL flexviews. add_table(@mvid, 'large', 'dim_date', 'd', 'ON d. date_id = f. date_id'); -- Add a GROUP BY expression CALL flexviews. add_expr(@mvid, 'GROUP', 'i. item_category', 'category'); CALL flexviews. add_expr(@mvid, 'GROUP', 'd. year', 'year'); CALL flexviews. add_expr(@mvid, 'GROUP', 'd. month', 'month'); CALL flexviews. add_expr(@mvid, 'SUM', 'qty * amt', 'net_amt'); CALL flexviews. add_expr(@mvid, 'COUNT', '*', 'CNT'); -- Add a WHERE clause CALL flexviews. add_expr(@mvid, 'WHERE', 'f. status="CLOSED"', 'where'); -- Create the view CALL flexviews. enable(@mvid);

Create functional and partial Indexes v DB 2 and Oracle and other databases have added support in recent years for “function based” or “partial” indexes. These databases have extensible features that essentially allow a materialized view to act as an index. v My. SQL lacks such extensibility features, but you can do it manually. This works best for batch processes, or processes which will tolerate some data staleness, since all Flexviews are refreshed asynchronously. v Eventually, a query rewrite proxy may allow such views to be exploited automatically. v When are they useful? v A column with a skewed cardinality. For instance, an order_line table where 99% of the ‘status’ values are something other than ‘CLOSED’. In such a situation it is valuable to index only the values other than ‘CLOSED’, since the ‘CLOSED’ records will inflate an index on the column as the ‘CLOSED’ keys would be of little use. v You want to index the output of any DETERMINISTIC function on one or more database columns.

Special Thanks To Salem, K. , Beyer, K. , Lindsay, B. , and Cochrane, R. 2000. How to roll a join: asynchronous incremental view maintenance. SIGMOD Rec. 29, 2 (Jun. 2000), 129 -140. DOI= http: //doi. acm. org/10. 1145/335191. 335393 Mumick, I. S. , Quass, D. , and Mumick, B. S. 1997. Maintenance of data cubes and summary tables in a warehouse. SIGMOD Rec. 26, 2 (Jun. 1997), 100 -111. DOI= http: //doi. acm. org/10. 1145/253262. 253277 My former employer Ad. Brite, Inc http: //www. adbrite. com who generously supported the development and release of Flexviews under the LGPL license. Gazillion, Inc http: //www. gazillion. com my current employer, who has allowed me to devote a good portion of my time to improving Flexviews. Source. Forge, who hosts the SVN repository, web site and support forums for Flexviews. Proven Scaling, LLC, and Jeremy Cole, in particular.

QUESTIONS
- Slides: 23