Analytics on Microsoft Excel Overview of solutions around

  • Slides: 63
Download presentation
Analytics on Microsoft Excel Overview of solutions around the platform from a real world

Analytics on Microsoft Excel Overview of solutions around the platform from a real world perspective Alberto Guillén 03. June 2008

§ Capgemini is a leading company with long experience in technology services Alberto Guillén

§ Capgemini is a leading company with long experience in technology services Alberto Guillén § We are one of the biggest actors in Business Intelligence in Norway Risk Management & Complianc § A major demand from our clients is delivering solutions in Microsoft Excel, we have continuously updated our efforts to adapt clients’ needs Consultant MSc. Mathematics MSc. Statistics Phone: +47 46444721 E-Mail: alberto. guillen@capgemini. com 1

Excel is more than worksheet functions and tables Basic statistical environment EXCEL Data Visualization

Excel is more than worksheet functions and tables Basic statistical environment EXCEL Data Visualization and Data Mining Desktop for analytical solutions Front for Data Warehouse Collection (SQL Server) GUI for in-house coded programs (VBA) Reporting and Monitoring Tool (Performance Point) Excel is the Industry standard for end-user calculations, and also as front interface 2

Analytical solutions can be created on different complexity layers beyond basic Excel Statistical Programming

Analytical solutions can be created on different complexity layers beyond basic Excel Statistical Programming Languages Data Mining Add-in Analysis Tool pack Excel Thirdparty Add-ins Solver VBA 3

Analytical solutions can be created on different complexity layers beyond basic Excel Statistical Programming

Analytical solutions can be created on different complexity layers beyond basic Excel Statistical Programming Languages Data Mining Add-in Analysis Tool pack Excel Thirdparty Add-ins Solver VBA 4

Analytical solutions can be created on different complexity layers beyond basic Excel Statistical Programming

Analytical solutions can be created on different complexity layers beyond basic Excel Statistical Programming Languages Data Mining Add-in Analysis Tool pack Excel Thirdparty Add-ins Solver VBA 5

Analytical solutions can be created on different complexity layers beyond basic Excel Statistical Programming

Analytical solutions can be created on different complexity layers beyond basic Excel Statistical Programming Languages Data Mining Add-in Analysis Tool pack Excel Thirdparty Add-ins Solver VBA 6

Analytical solutions can be created on different complexity layers beyond basic Excel Statistical Programming

Analytical solutions can be created on different complexity layers beyond basic Excel Statistical Programming Languages Data Mining Add-in Analysis Tool pack Excel Thirdparty Add-ins Solver VBA 7

Analytical solutions can be created on different complexity layers beyond basic Excel Statistical Programming

Analytical solutions can be created on different complexity layers beyond basic Excel Statistical Programming Languages Data Mining Add-in Analysis Tool pack Excel Thirdparty Add-ins Solver VBA 8

The average user masters the standard Excel tools Tables and Filters BASIC EXCE Worksheet

The average user masters the standard Excel tools Tables and Filters BASIC EXCE Worksheet functions Charts Pivot Tables ”What If” Analysis Standard Excel allows direct interaction with the raw data 9

The Analysis Toolpack makes rigorous analysis possible ANALYSIS TOOLPA § Free Microsoft download §

The Analysis Toolpack makes rigorous analysis possible ANALYSIS TOOLPA § Free Microsoft download § Created for statisticians § ANOVA § Correlation § Covariance § Descriptive Statistics § Exponential Smoothing § F-Test Two-Sample for Variances § Fourier Analysis § Histogram § Moving Average § Random Number Generation § Rank and Percentile § Regression § Sampling § t-Test § z-Test Basic statistical analyses are available through Analysis Toolpack 10

Solver leverages computational abilities SOLVER § Free Microsoft download § Optimizing and root finding

Solver leverages computational abilities SOLVER § Free Microsoft download § Optimizing and root finding set of algorithms § Can be called on the background from VBA § Practical but slow in heavy calculations § Not exact convergence sometimes! § Can be tuned § Needs good seeds Solver implements standard algorithms for mathematical optimization problems 11

Solver leverages computational abilities SOLVER § Free Microsoft download § Optimizing and root finding

Solver leverages computational abilities SOLVER § Free Microsoft download § Optimizing and root finding set of algorithms § Can be called on the background from VBA § Practical but slow in heavy calculations § Not exact convergence sometimes! § Can be tuned § Needs good seeds Solver implements standard algorithms for mathematical optimization problems 12

Third party add-ins provide easily new functionalities ADD-INS § Cheap § Simple § Easy

Third party add-ins provide easily new functionalities ADD-INS § Cheap § Simple § Easy to use § No development efforts Many small software developers use Excel as their GUI 13

There are several third party add-ins offering solutions on quantitative analysis and Monte Carlo

There are several third party add-ins offering solutions on quantitative analysis and Monte Carlo simulation § Monte Carlo add-ins: • Crystal Ball • Quantech’s @Risk 5. 0 • Risk. AMP • Lumenaut • Simtools • … ADD-INS § Also free. xla on the web Hundreds of free or cheap add-ins offer various solutions on fields like Risk Management 14

The Table Analysis Tools add-in brings data mining capabilities DATA MININ § Included in

The Table Analysis Tools add-in brings data mining capabilities DATA MININ § Included in the data mining add-in § Scenario analysis differs from the standard tool § Brings the power of data mining to end-users Data Mining is embedded into table functionality 15

The Data Mining add-in easies data mining to business analysts DATA MININ § What

The Data Mining add-in easies data mining to business analysts DATA MININ § What is Data Mining? • Data mining is frequently described as "the process of extracting valid, authentic, and actionable information from large databases. “ § Microsoft’s approach to Data Mining: • Business Intelligence with a user-friendly interface, accessible to end-users and developers § Software • SQL Server 2005/2008 (Visual Studio BI) • Excel/Visio add-ins • DMX • ADOMD. Net / AMO Microsoft brings Data Mining to business users for the first time 16

Microsoft takes a different approach to Data Mining DATA MININ Donald Farmer - Principal

Microsoft takes a different approach to Data Mining DATA MININ Donald Farmer - Principal Program Manager for Microsoft’s Data Mining "We don't have all the functionality of something like a SAS or an SPSS, because that's just not our market. […] Our market just has to be a much larger market“ “We have a huge database marketing team who do classic customer analysis. These guys were all SAS users, but when they joined Microsoft, they started using our tools. […], they actually use the Excel data mining add-ins to do it. It's not that there's nothing they don't miss, it's that they are able to achieve the same business results using our tools. “ "For a function such as 'Detect Categories, ' what the add-in is doing is building a clustering model in the background […], but we don't expect the Excel user to understand that. We just call it 'Build Categories, '“ "We're seeing a lot of interest in the Excel-side data mining, for one thing, but we're also seeing interest in the embed-ability, too. The people who are actually pushing this are from the developer side. Microsoft will not compete with traditional DM vendors, Microsoft targets other users 17

Data Mining assists in various business processes DATA MININ Top Business Scenarios for DM

Data Mining assists in various business processes DATA MININ Top Business Scenarios for DM § Cross-sell and up-sell Main DM tasks § Campaign management § Customer acquisition § Budget and forecasting § Customer retention § New fields: manufacturing, retail and entertaiment § Classification § Estimation § Prediction § Association § Clustering Data Mining is being used in several business areas 18

Data Mining is performed in SQL Server 2005 / 2008 DATA MININ SQL Server

Data Mining is performed in SQL Server 2005 / 2008 DATA MININ SQL Server Business Development Studio and DMX code is the natural environment 19

Data Mining is also accessible through Excel 2007 DATA MININ §The Excel add-in acts

Data Mining is also accessible through Excel 2007 DATA MININ §The Excel add-in acts as a client to an instance of Analysis Services §Both Excel and SQL Server Analysis Services support the full DM Cycle: Data understandi ng Data preparation Modeling Validation Deployment Excel sends DM queries and data directly to SQL Server Analysis Services 20

Data Mining is an iterative process DATA MININ A mining model is part of

Data Mining is an iterative process DATA MININ A mining model is part of a larger process that includes everything Problem Working environment ? Deployme nt This process can be defined by using the following six basic steps: §Defining the problem §Preparing Data §Exploring Data §Building Models §Exploring and Validating Models §Deploying and Updating Although the process is illustrated as circular, creating a data mining model is a dynamic and iterative process 21

There are 9 available Data Mining algorithms on Excel DATA MININ § Decision/Regression Trees

There are 9 available Data Mining algorithms on Excel DATA MININ § Decision/Regression Trees § Clustering § Naïve Bayes § Association rules § Sequence clustering § Time series § Neural Networks § Logistic regression § Linear regression § Plug-in algorithms • Third-party or self programmed implementing a set of COM interfaces 9 built-in algorithms can be tuned to obtain new ones 22

Decision and Regression trees find natural splits DATA MININ § Decision trees classify and

Decision and Regression trees find natural splits DATA MININ § Decision trees classify and find associations § Regression trees build segmented regressions § Example: • Identify potential buyers Decision trees give decision rules that are suitable to business understanding 23

Clustering finds homogeneous groups DATA MININ age § Example: Find segments of similar clients

Clustering finds homogeneous groups DATA MININ age § Example: Find segments of similar clients income Clustering can find hidden classes and identify outliers 24

Clustering finds homogeneous groups age DATA MININ ”older” age 2 cars no children §

Clustering finds homogeneous groups age DATA MININ ”older” age 2 cars no children § Example: Find segments of similar clients Middle age Many cars and children Young people No children income Clustering can find hidden classes and identify outliers 25

Naïve Bayes provides probabilities of group membership DATA MININ ? ? § Example: marketing

Naïve Bayes provides probabilities of group membership DATA MININ ? ? § Example: marketing campaign Naîve Bayes is an efficient method to asses probability of classification 26

Association rules unveils hidden logic DATA MININ § Example: Shopping Basket Association rules visualizes

Association rules unveils hidden logic DATA MININ § Example: Shopping Basket Association rules visualizes logical rules that underly your business 27

Sequence clustering finds event patterns in time DATA MININ § Example: Web navigation Sequence

Sequence clustering finds event patterns in time DATA MININ § Example: Web navigation Sequence clustering identifies clusters of similarly ordered events in a sequence 28

Time series forecasts processes in time DATA MININ § ARTx Microsoft proprietary algorithm §

Time series forecasts processes in time DATA MININ § ARTx Microsoft proprietary algorithm § ARIMA available in SQL Server 2008 historical predicted § Example: forecast seasonal sales to keep suitable stock The past patterns that it discovers can be used to predict values for future time steps. 29

Time series forecasts processes in time DATA MININ § ARTx Microsoft proprietary algorithm §

Time series forecasts processes in time DATA MININ § ARTx Microsoft proprietary algorithm § ARIMA available in SQL Server 2008 historical predicted § Example: forecast seasonal sales to keep suitable stock The past patterns that it discovers can be used to predict values for future time steps. 30

Neural networks discovers predictive patterns by learning DATA MININ § Example: fraud detection Neural

Neural networks discovers predictive patterns by learning DATA MININ § Example: fraud detection Neural networks learns in an uncontrolled manner 31

Logistic regression predicts binary responses DATA MININ § Microsoft Logistic Regression is implemented as

Logistic regression predicts binary responses DATA MININ § Microsoft Logistic Regression is implemented as a trivial neural network § Example: Probability of credit default based on personal information Logistic regression gives probabilities of ”YES/NO” given some attributes 32

Linear regression is of course also available DATA MININ § It is however extended

Linear regression is of course also available DATA MININ § It is however extended by Regression Trees (Linear Regression is implemented as a particular case) § Example: extrapolate the influence of oil price on house prices The classical linear regression is also integrated in the add-in 33

Chosen examples vs. real life problems DATA MININ § So far, we have seen

Chosen examples vs. real life problems DATA MININ § So far, we have seen chosen examples: • Shopping basket • Web navigation • Market segmentation • … § Unfortunately, it is not that easy; data Mining is a creative and unclear process. Sometimes there is no answer with data mining. § Books don’t show examples on when not to use the algorithms • Time series: long forecasts • Classification trees: credit scoring • Sequence clustering: non-markovian processes • … Bottom line: understand the statistical models behind the icons 34

Data Mining: using the Data Mining add-in to forecast Credit Default § Logistic regression

Data Mining: using the Data Mining add-in to forecast Credit Default § Logistic regression Ranked classes Number of payments Income DATA MININ A score Probability of default Age Civil status B C D § The algorithm uses Z-scores transformations § Cut-offs should be calculated when deploying After training the algorithm, probabilities of default can be predicted for new applicants 35

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ %

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ % recovered t § Problem: need the algorithm k-nearest neighbours • Can be implemented as a plug-in algorithm Some problems require creative approaches 36

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ §

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ § Hybrid between time series and regression trees % recovered Training… period t § Problem: need the algorithm k-nearest neighbours • Can be implemented as a plug-in algorithm Some problems require creative approaches 37

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ §

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ § Hybrid between time series and regression trees % recovered Training… period t § Problem: need the algorithm k-nearest neighbours • Can be implemented as a plug-in algorithm Some problems require creative approaches 38

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ §

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ § Hybrid between time series and regression trees % recovered Training… period t § Problem: need the algorithm k-nearest neighbours • Can be implemented as a plug-in algorithm Some problems require creative approaches 39

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ §

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ § Hybrid between time series and regression trees % recovered Training… period t § Problem: need the algorithm k-nearest neighbours • Can be implemented as a plug-in algorithm Some problems require creative approaches 40

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ §

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ § Hybrid between time series and regression trees % recovered Training… period t § Problem: need the algorithm k-nearest neighbours • Can be implemented as a plug-in algorithm Some problems require creative approaches 41

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ §

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ § Hybrid between time series and regression trees % recovered Predicting… One period older period Period = Period +1 Age = Age + … t § Problem: need the algorithm k-nearest neighbours • Can be implemented as a plug-in algorithm Some problems require creative approaches 42

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ §

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ § Hybrid between time series and regression trees % recovered x Training… period t § Problem: need the algorithm k-nearest neighbours • Can be implemented as a plug-in algorithm Some problems require creative approaches 43

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ §

Data Mining: using the Data Mining add-in to forecast Debt Recovery DATA MININ § Hybrid between time series and regression trees % recovered x Predicting… period t § Problem: need the algorithm k-nearest neighbours • Can be implemented as a plug-in algorithm Some problems require creative approaches 44

VBA for Excel is the main tool for automated solutions VBA § Communication with

VBA for Excel is the main tool for automated solutions VBA § Communication with other software (COM Server) § Build algorithms not available in Excel § Automation of processes (“Macro programming”) § Easy and quick interaction with the solution through Active. X Buttons and Userforms § Possibility to embed analytical solutions in a simple user-front to end users without the right competence VBA allows in-house built solutions 45

VBA: Building a statistical tool for analyzing and forecasting Debt Collection VBA § With

VBA: Building a statistical tool for analyzing and forecasting Debt Collection VBA § With VBA it is possible to deliver customized solutions to end users § Problems: a lot of work to implement statistical algorithms, Solver can get slow VBA is the tool to use to provide end-users with an interactive work station 46

There are no limits with statistical programming languages R Statistical Programming Languages (COM Server)

There are no limits with statistical programming languages R Statistical Programming Languages (COM Server) Data Mining Add-in Analysis Tool pack Excel Thirdparty Add-ins Solver VBA 47

There are no limits with statistical programming languages R Statistical Programming Languages (COM Server)

There are no limits with statistical programming languages R Statistical Programming Languages (COM Server) 1991 DDE OLE 1996 Active. X DCOM 1999 COM+ . Net WCF Data Mining Add-in Analysis Tool pack Excel Thirdparty Add-ins Solver VBA 48

R is becoming the standard in the scientific community R § R is a

R is becoming the standard in the scientific community R § R is a statistical programming language with syntax similar to S-plus • R is free (under GNU license) • R uses statistical libraries created by statisticians all over the world § R communicates with Excel through a COM server • ”COM” is a set of interfaces that covers OLE, Active. X, DCOM, . . . § R Excel add-in • Background mode • Small Ribbon toolbar • Fast • Code embedded in: - Worksheet functions - VBA - Cells R allows analysts to implement the most advanced mathematical models 49

Histograms are a dangerous tool to approximate empirical pdf’s R With standard Excel: 50

Histograms are a dangerous tool to approximate empirical pdf’s R With standard Excel: 50

Histograms are a dangerous tool to approximate empirical pdf’s R With standard Excel: 51

Histograms are a dangerous tool to approximate empirical pdf’s R With standard Excel: 51

Histograms are a dangerous tool to approximate empirical pdf’s R With standard Excel: 52

Histograms are a dangerous tool to approximate empirical pdf’s R With standard Excel: 52

With the R add-in, advanced semiparametric methods are available R With the R add-in

With the R add-in, advanced semiparametric methods are available R With the R add-in for Excel: Empirical probability distribution functions are easily approximated with the R add-in 53

With the R add-in, advanced semiparametric methods are available R With the R add-in

With the R add-in, advanced semiparametric methods are available R With the R add-in for Excel: Empirical probability distribution functions are easily approximated with the R add-in 54

With the R add-in, advanced semiparametric methods are available R With the R add-in

With the R add-in, advanced semiparametric methods are available R With the R add-in for Excel: Empirical probability distribution functions are easily approximated with the R add-in 55

With the R add-in, advanced semiparametric methods are available R With the R add-in

With the R add-in, advanced semiparametric methods are available R With the R add-in for Excel: Empirical probability distribution functions are easily approximated with the R add-in 56

With the R add-in, advanced semiparametric methods are available R With the R add-in

With the R add-in, advanced semiparametric methods are available R With the R add-in for Excel: Empirical probability distribution functions are easily approximated with the R add-in 57

§ 600 § 0 §'Jan 96' §'May 96' §'Sep 96' §'Jan 97' §'May 97'

§ 600 § 0 §'Jan 96' §'May 96' §'Sep 96' §'Jan 97' §'May 97' §'Sep 97' §'Jan 98' §'May 98' §'Sep 98' §'Jan 99' §'May 99' §'Sep 99' §'Jan 00' §'May 00' §'Sep 00' §'Jan 01' §'May 01' §'Sep 01' §'Jan 02' §'May 02' §'Sep 02' §'Jan 03' §'May 03' §'Sep 03' §'Jan 04' §'May 04' §'Sep 04' §'Jan 05' §'May 05' §'Sep 05' §'Jan 06' §'May 06' §'Sep 06' §'Jan 07' §'May 07' §'Sep 07' §'Jan 08' Statistical programming languages: problem case § 400 R § Forecasting Oslo Børs Hovedindeks §Oslo Børs § 500 ? § 300 § 200 § 100 … Some problems demand advanced statistical approaches 58

Statistical programming languages: problem case R § Using Monte Carlo simulation to predict default

Statistical programming languages: problem case R § Using Monte Carlo simulation to predict default in Specialized Lending NIBOR Oil price Complex multivariate Monte Carlo models are developed fast in R 59

Industrial solutions are another alternative Statistical Programming Languages (COM Server) Industry vendo Industrialized Vendors

Industrial solutions are another alternative Statistical Programming Languages (COM Server) Industry vendo Industrialized Vendors Analysis Tool pack Data Mining Add-in Excel Thirdparty Add-ins Solver VBA 60

Industrial solutions should be chosen only if the area requires it Industry vendo When

Industrial solutions should be chosen only if the area requires it Industry vendo When to consider Industrial Solutions: § Big companies (important deployment) § Special industrial subject area § Data warehouse integration § Highly competent staff within Analytics § Expensive investment: study worthiness § Whatever vendor, check Excel compatibilities (reporting, platform migrations, …) The previously presented alternatives for Excel can do their job at end-user level 61

Further references § Capgemini: • www. no. capgemini. com (alberto. guillen@capgemini. com) § Microsoft

Further references § Capgemini: • www. no. capgemini. com (alberto. guillen@capgemini. com) § Microsoft Data Mining: • http: //www. sqlserverdatamining. com • http: //www. microsoft. com/sqlserver/2008/en/us/data-mining- addins. aspx § R: • http: //www. r-project. org • http: //sunsite. univie. ac. at/rcom/ 62