Vector and Matrix Operations Evaluated with UserDefined Functions

Outline • • Research overview Motivation SQL limitations UDFs Scalar and aggregate UDFs Experiments

Research overview: Relational database systems • Query optimization: query rewriting, linear recursive queries, indexing

Research overview: Machine learning • • Clustering: K-means, EM, streams Dimensionality reduction: PCA and

Motivation • Perform statistical and machine learning analysis inside a DBMS • Users extract

SQL limitations • • • Relational model Limited array support Run-time overhead Set-oriented Not

UDF advantages • • Programmed in C Called in a SELECT statement API provided

UDFs limitations • • • In general, simple data types OS and DBMS architecture

Alternatives to UDFs • Inside DBMS – Internal C code in the DBMS –

Scalar UDFs • • Vectorial sum Dot product Distance argmin

Scalar UDFs application • Regression equations – Linear – Binary logistic • Distance computation

Aggregate UDFs • Best for functions that require a table scan on an entire

Aggregate UDFs for linear models • Descriptive models • Predictive models • Statistical tests

Aggregate UDF optimizations • • Arrays Diagonal/triangular/full matrix Parameter passing on the stack Full

Experiments • Compare SQL and UDFs • SQL standard aggregations on pivoted version of

UDF: sum, argmin • Much faster than standard SQL aggregation • Slightly slower than

UDF: sufficient statistics alternatives in SQL • Simulate matrices with one entry per row

UDF aggregate • One std aggregation significantly slower • UDF is faster than arithmetic

UDF run-time execution • Scalar UDF bottleneck: disk I/O • Aggregate UDF bottleneck: memory

UDF scalability • Linear in n • Linear in d for scalar UDF •

Conclusions • • • Extend DBMS with mathematical operators UDF called in SQL queries

Slides: 22

Download presentation

Vector and Matrix Operations Evaluated with User-Defined Functions Data Warehousing

Outline • • Research overview Motivation SQL limitations UDFs Scalar and aggregate UDFs Experiments Conclusions

Research overview: Relational database systems • Query optimization: query rewriting, linear recursive queries, indexing • Integration of statistical and machine learning algorithms: linear models, statistical tests, decision trees, Bayesian classifiers • OLAP: cubes, multi-level aggregations, pivoting, compression • Database cleaning: detecting and repairing errors • Locking in a parallel DBMS

Research overview: Machine learning • • Clustering: K-means, EM, streams Dimensionality reduction: PCA and Max Llh FA Regression: linear, logistic, non-linear models Bayesian classifiers: Naïve, Nets, MDA Association rules for prediction Medical data analysis Other topics: image mining, neural nets

Motivation • Perform statistical and machine learning analysis inside a DBMS • Users extract data sets to specialized statistical or mathematical tools • SQL is the standard language • A DBMS is a complex software system • Focus: DBMS architecture and scalability

SQL limitations • • • Relational model Limited array support Run-time overhead Set-oriented Not an imperative programming language

UDF advantages • • Programmed in C Called in a SELECT statement API provided by DBMS User does not need modifying DBMS internal code

UDFs limitations • • • In general, simple data types OS and DBMS architecture dependent Limited stack memory space Heap memory only for aggregate UDFs No I/O capability Null handling Parallel execution Compile-time memory allocation Not uniform across different DBMSs

Alternatives to UDFs • Inside DBMS – Internal C code in the DBMS – SQL code generation: arithmetic expressions or SQL existing aggregations • Outside DBMS – Statistical or machine learning packages – Statistical libraries called in C++/Java

Scalar UDFs • • Vectorial sum Dot product Distance argmin

Scalar UDFs application • Regression equations – Linear – Binary logistic • Distance computation in clustering or nearest neighbor classifiers • Multivariate pdfs

Aggregate UDFs • Best for functions that require a table scan on an entire table • Partition induced on data set with groups • Sufficient statistics

Aggregate UDFs for linear models • Descriptive models • Predictive models • Statistical tests

Sufficient statistics: SQL and C

Aggregate UDF optimizations • • Arrays Diagonal/triangular/full matrix Parameter passing on the stack Full parallel execution

Experiments • Compare SQL and UDFs • SQL standard aggregations on pivoted version of input table • Teradata Parallel DBMS

UDF: sum, argmin • Much faster than standard SQL aggregation • Slightly slower than arithmetic expressions

UDF: sufficient statistics alternatives in SQL • Simulate matrices with one entry per row with pivoted table • Aggregation list with all matrix on one row • UDF

UDF aggregate • One std aggregation significantly slower • UDF is faster than arithmetic expressions

UDF run-time execution • Scalar UDF bottleneck: disk I/O • Aggregate UDF bottleneck: memory

UDF scalability • Linear in n • Linear in d for scalar UDF • almost “flat” in d for aggregate UDF

Conclusions • • • Extend DBMS with mathematical operators UDF called in SQL queries Group-by capability Great performance Several UDF limitations can be solved