Statistical Analysis with Oracle Marc Daniau Product group

  • Slides: 51
Download presentation
Statistical Analysis with Oracle Marc Daniau – Product group marc. daniau@sap. com

Statistical Analysis with Oracle Marc Daniau – Product group marc. daniau@sap. com

Introduction The ANSI SQL offers analytical capabilities useful for advanced data analysis like linear

Introduction The ANSI SQL offers analytical capabilities useful for advanced data analysis like linear regression and correlation. Beyond supporting the standard SQL functions Oracle provides additional vendor-specific functions that enable to perform statistical analysis such as descriptive statistics and hypothesis testing. In the following presentation we will put into practice the main statistical built-in functions from Oracle 10 g release 2. We will also show an example of a userdefined function. 2 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing 3 COPYRIGHT

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing 3 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Summarizing a data set The built-in SQL functions involved in the following sample are

Summarizing a data set The built-in SQL functions involved in the following sample are avg, median, stats_mode, variance, stddev, min, max and percentile_cont. Note there is no built-in functions for skew and kurtosis; we had to use custom SQL expressions. 4 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

The BOBJ Universe Here is a sample Universe exposing the stats_* functions. New functions

The BOBJ Universe Here is a sample Universe exposing the stats_* functions. New functions can be added to the list by updating the ‘oracle. prm’ file. 5 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Box Plot The quartiles are computed using the Oracle SQL function percentile_cont() within group

Box Plot The quartiles are computed using the Oracle SQL function percentile_cont() within group (order by ()). 6 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

A real-life example: GE Delivery span The delivery span focuses on the time between

A real-life example: GE Delivery span The delivery span focuses on the time between when a customer requested the product and when it was delivered. The span is a measure of variation similar to the interquartile range but instead of looking at the middle 50% of the observations, it looks at the middle 90%. Span = 95 th Percentile – 5 th Percentile The Oracle SQL functions to compute the span are: percentile_cont(0. 95) 7 and percentile_cont(0. 05). COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

A real-life example: GE Delivery span The goal is to squeeze the two sides

A real-life example: GE Delivery span The goal is to squeeze the two sides of the delivery span, days early & days late, ever closer to the center: the exact day the customer desired. Reducing variation is what quality is about. Days Late Days Early 8 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing 9 COPYRIGHT

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing 9 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Simple Linear Regression Studying the relationship of two time series The built-in SQL functions

Simple Linear Regression Studying the relationship of two time series The built-in SQL functions used are regr_* and stats_*. In this Desk. I document, it appears that X and Y increase together. 10 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Simple Linear Regression Checking the normality of the residuals The SQL function row_number() over()

Simple Linear Regression Checking the normality of the residuals The SQL function row_number() over() is used. A custom PL/SQL function inverse_phi() is employed for building the X axis; an alternative consists of using a lookup Z table. 11 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

User defined functions The Oracle user can extend the list of built-in functions with

User defined functions The Oracle user can extend the list of built-in functions with his own PL/SQL functions. Following are examples of custom functions. The custom function inverse_phi(). 12 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

User defined functions Custom functions can be exposed in the BOBJ SQL editor by

User defined functions Custom functions can be exposed in the BOBJ SQL editor by adding them in the ‘oracle. prm’ file. The custom function inverse_phi(). 13 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Simple Linear Regression Detecting Outliers (unusual values) The variables X and Y in this

Simple Linear Regression Detecting Outliers (unusual values) The variables X and Y in this example are not time series. Mortality (Y) tends to decrease as education (X) increases. The limits are computed with the regr_* SQL functions. Ouliers appear in red. 14 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Simple Linear Regression Detecting Outliers (unusual values) After changing the sigma factor from 2

Simple Linear Regression Detecting Outliers (unusual values) After changing the sigma factor from 2 to 2. 5, New Orleans is no longer beyond limits. 15 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Simple Non Linear Regression Some simple non-linear relationships can be transformed into linear relationships.

Simple Non Linear Regression Some simple non-linear relationships can be transformed into linear relationships. Every time you can transform a relationship into an equation of the form Y = a + b X , you can use the least squares method to fit the data. This technique is used for fitting curves such as exponential, logarithmic, power, hyperbola, logistic (Pearl) and Gompertz. 16 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Simple Non Linear Regression Fitting a S-shaped curve (Pearl) The sample below plots the

Simple Non Linear Regression Fitting a S-shaped curve (Pearl) The sample below plots the cumulative sales by month for a given product. The X and Y variables are computed using respectively row_number() over and sum()) over(). The ln() function is used for transformation; regr_slope() and regr_intercept() for getting the regression coefficients. 17 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Simple Non Linear Regression Fitting a S-shaped curve (Gompertz) The error values are computed

Simple Non Linear Regression Fitting a S-shaped curve (Gompertz) The error values are computed within the BOBJ report. The number of periods to forecast 18 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Weighted Least Squares (local regression) Smoothing a time series The moving slope and intercept

Weighted Least Squares (local regression) Smoothing a time series The moving slope and intercept are calculated on the RDBMS side. The user chooses the strength of the smoother (moving window size). 19 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Multiple Linear Regression Forecasting The regression coefficients are calculated on the fly by the

Multiple Linear Regression Forecasting The regression coefficients are calculated on the fly by the rdbms using regr_* functions. 20 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing 21 COPYRIGHT

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing 21 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Checking for auto-correlation The Lag Plot The lag plot helps uncover seasonality and patterns

Checking for auto-correlation The Lag Plot The lag plot helps uncover seasonality and patterns from sequential data (e. g. time series) that may have been missed by looking only at a line chart. The lag plot is useful to check auto-correlation (lack of independence in the series). It consists of a scatter diagram plotting Yt on the vertical axis versus Yt-lag on the horizontal axis. The Oracle SQL function lag(, ) over() is illustrated next. 22 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Checking for auto-correlation The Lag Plot The sample line chart below tells us there

Checking for auto-correlation The Lag Plot The sample line chart below tells us there is no shift of average over time. The lag plot (lag 1) on the right shows a pattern that cannot be seen in the line chart. Note that outliers (unusual values) appear more clearly on the lag plot. 23 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Auto-correlation function (ACF) While the lag plot displays the individual data points for a

Auto-correlation function (ACF) While the lag plot displays the individual data points for a given lag, the ACF plot gives a summary picture of autocorrelation over multiple lags. The ACF plot is useful in identifying seasonal or cyclical patterns in a time series. It shows the correlation coefficient statistic on the vertical axis over the different lags on the horizontal axis. The Oracle SQL functions regr_* are used in the following example. 24 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

ACF Plot The example below is the ACF plot of the monthly Australian beer

ACF Plot The example below is the ACF plot of the monthly Australian beer production. The thin bars represent the correlation coefficients. The dotted flat lines are the significance limits. The peak on lag 12 indicates seasonality. 25 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing 26 COPYRIGHT

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing 26 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Comparison types Organizations measure performance of individual entities such as products, geographies, suppliers or

Comparison types Organizations measure performance of individual entities such as products, geographies, suppliers or employees by making different types of comparisons : Over time Against pre-determined goals Against a comparator group 27 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Evaluation against a group Most of the examples presented in this section illustrate how

Evaluation against a group Most of the examples presented in this section illustrate how to evaluate the performance of an individual entity against a comparator group. We will see various methods: Percent rank T-score Percent of leader Range score Percent of total Individual ratio versus group ratio 28 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Percent Rank The Web. I document below ranks countries on economic performance. It involves

Percent Rank The Web. I document below ranks countries on economic performance. It involves the following Oracle SQL functions: rank() over() 29 and percent_rank() over(). COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

T-score Comparing against the group average The Xcelsius radar chart here requires the avg()

T-score Comparing against the group average The Xcelsius radar chart here requires the avg() over() and stddev_pop() over() SQL functions. The data set consists of 7 indicators and 7 countries. The blue line represents the group average normalized at value 50. The amber line corresponds to the country T-score (here USA). The slicer allows the user to see the performance of the different countries in the G 7 group. 30 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Percent of Leader Comparing against the best in the group The sample document below

Percent of Leader Comparing against the best in the group The sample document below rates UK versus other nations on educational research. It requires the max() over() SQL functions. The underlying data set covers 35 nations and includes 7 indicators within a period from 1987 to 1998. 31 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Range score Positioning the individual in the group range The report below evaluates companies

Range score Positioning the individual in the group range The report below evaluates companies performance based on 3 indicators. The range score normalizes the results before computing a composite score. It involves the SQL functions min() over() and max() over(). We apply a weighted average to obtain a composite score. This is done within the BOBJ document. 32 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Percent of Total Even though it is limited to additive indicators such as revenue,

Percent of Total Even though it is limited to additive indicators such as revenue, cost or number of customers, the percent of group total is commonly used. The Pareto chart is an example of graph that displays the percentage of each entity (represented as bars) relative to the total value of the group. It helps identify the largest contributors by presenting them first on the left side of the graph. 33 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Percent of Total - Pareto chart The Oracle built-in function ratio_to_report() over() enable to

Percent of Total - Pareto chart The Oracle built-in function ratio_to_report() over() enable to compute the percent of total required by the Pareto chart. 34 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Cumulative Percent of Total Here the percent of total for the two indicators is

Cumulative Percent of Total Here the percent of total for the two indicators is obtained with the ratio_to_report() over() SQL function. The cumulative sum is done within the BOBJ crosstab. The report indicates that Callahan generates the largest revenue with the smallest number of customers. We also observe that roughly half of the Sales (48. 7%) is generated by a quarter of the customers (22. 4%) … and by a quarter of the sales force (Callahan and King). 35 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Individual ratio versus group ratio We just saw how we can apply a percent

Individual ratio versus group ratio We just saw how we can apply a percent of total to additive indicators like Sales and Profit. Let’s assume we want to track the performance of Gross Profit rate calculated as Gross Profit divided by Gross Sales. The percent of total analysis will not work in that case since ratios are not additive. We could use the average of the Gross Profit rates in the group as the baseline but instead we will compare to the group Gross Profit rate obtained as: group total Gross Profit / group total Gross Sales. 36 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Individual ratio versus group ratio We use the rollup() SQL function in order to

Individual ratio versus group ratio We use the rollup() SQL function in order to generate on the fly a summary row that aggregates values for Sales, Profit and Profit rate. The blue cursors on both sides of the table indicate the group row which serves as the baseline. The categories above the baseline exhibit a better profit rate than the group. 37 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Measuring process performance Companies that implement a quality improvement methodology measure performance of their

Measuring process performance Companies that implement a quality improvement methodology measure performance of their processes by analyzing the variation. A process is given upper and lower specification limits (USL & LSL) and a target. Variation over the target is measured to evaluate how capable the process is relative to the specifications. One process performance measurement method is presented next for illustration. 38 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Process capability/performance Measuring process variations The Oracle built-in SQL function stats_one_way_anova() returns the variations.

Process capability/performance Measuring process variations The Oracle built-in SQL function stats_one_way_anova() returns the variations. The custom PL/SQL function phi() is required to get the ppm/defect rates. 39 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing 40 COPYRIGHT

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing 40 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Frequency Histogram – 1/3 The bucketing of individual values is performed with the width_bucket()

Frequency Histogram – 1/3 The bucketing of individual values is performed with the width_bucket() function. The count aggregation is done by the rdbms. Empty buckets are handled in the BOBJ report. The data set is from the Oracle HR demo database. Salary exibits a distribution stretched out towards the right. 41 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Frequency Histogram – 2/3 Running the query. The 3 arguments of the width_bucket() function

Frequency Histogram – 2/3 Running the query. The 3 arguments of the width_bucket() function are prompted The end-user can enter a custom value: e. g. 0. 277 42 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED. Listing the main power-transforms.

Frequency Histogram – 3/3 The square root transform brings some symmetry to the salary

Frequency Histogram – 3/3 The square root transform brings some symmetry to the salary data. The power-transforms require the following SQL functions: power() and ln(). 43 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Automatic bucketing The BOBJ document below requires no inputs from the user with regards

Automatic bucketing The BOBJ document below requires no inputs from the user with regards to the histogram buckets. We use the Sturges rule to determine the number of bars automatically. The histogram count aggregation is performed by the rdbms. The normal plot complements the histogram in assessing the normality of the data set. 44 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

User-defined intervals In the sample below the end-user enters the histogram interval width. The

User-defined intervals In the sample below the end-user enters the histogram interval width. The count aggregation is done by the rdbms. The Gaussian curve is estimated within the BOBJ document. The blue bars represent the differences between the observed frequencies (histogram bars) and the estimated frequencies (fitted curve). The Gaussian curve is turned into a horizontal line (zero baseline). 45 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Weibull distribution The Hazard plot below uses the following SQL Functions: row_number() over() and

Weibull distribution The Hazard plot below uses the following SQL Functions: row_number() over() and sum() over(). The regression line is computed within the BOBJ document. 46 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing 47 COPYRIGHT

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing 47 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

t-test on a single group Comparing the mean to a target In this sample

t-test on a single group Comparing the mean to a target In this sample report we use the Oracle SQL function stats_t_test_one(). 48 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

t-test for two paired groups Before-After comparison of mean In this sample report we

t-test for two paired groups Before-After comparison of mean In this sample report we use the Oracle SQL function stats_t_test_paired(). 49 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

t-test for two independent groups Comparing means In this sample report we use the

t-test for two independent groups Comparing means In this sample report we use the Oracle SQL function stats_t_test_indep(). 50 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.

Analysis of variance Comparing means across multiple groups The Oracle SQL function stats_one_way_anova() is

Analysis of variance Comparing means across multiple groups The Oracle SQL function stats_one_way_anova() is used here. The following sample involves six groups. 51 COPYRIGHT © 2008 BUSINESS OBJECTS S. A. ALL RIGHTS RESERVED.