An Automatic Intelligent Model Building System with Business

An Automatic Intelligent Model Building System with Business Applications Dr. Morgan C. Wang Department of Statistics University of Central Florida Morgan C. Wang 2/5/2022 1

OUTLINES � Introduction � System Discription � Case Study II � Questions Morgan C. Wang 2/5/2022 2

Introduction 3 2/5/2022 Morgan C. Wang

Accenture and GE Report � 84% of companies believe that the big data analysis affect their industry in 2017 � 89% of companies believe that their competition advantage will erode with adequate big data analytics capability and their market share will reduce too。

Accenture and GE eport � Only 16% of all companies use big data analytics to perform predicted analytics � Only 13%of all companies use big data analytics to optimize their working process

Accenture and GE Report � Analytics: Summarize facts such as “Customer A claim cost is $5, 000 in 2016” � Predictive Analytics: Accurate predict what will happen in the future such as “Customer A will claim three times totally cost us $6, 000 in 2017”

Accenture and GE Report � In 2020, the internet market will become to 350 billion and 20% of these are data analytics。 � Analytics market will move from basic analytics to advance analytics。 � In China alone, the market size will be 70 billion

Shortage on Data Scientists � Despite the activity around Big Data, there is still a significant shortage of skilled professionals who can truly be called Data Scientists who can evaluate business needs and impact, write the algorithms and program platforms such as Hadoop. Morgan C. Wang 2/5/2022 8

System Description 9 2/5/2022 Morgan C. Wang

Automatic-Intelligent Model Building System � Yi. Ming Data has developed an automatic-intelligent model building system. The system has five components: data exploration component, data preparation component, model building/validation/selection component, result automatic generation/data scoring component, and model understanding component. All components reside within the data warehouse and can be used by company personnel without extensive model building training. Morgan C. Wang 2/5/2022 10

Case Study I Insurance Rate Making 11 2/5/2022 Morgan C. Wang

Data 12 2/5/2022 Morgan C. Wang

Data available for this study � Training Data: About 1, 370, 000 insurance policies came from all 16 units from a big metropolitan area � Validation and Testing Data: About 880, 000 policies came from the same metropolitan area the next year � There are 26 variables that were used by this company for many years Morgan C. Wang 2/5/2022 13

Study Goal 14 2/5/2022 Morgan C. Wang

Study Goal � � Build a more precise pricing model that can help this company to target their customers more fairly Use price elasticity to bring in more low risk customers by lower premium Use price to discourage more high risk customers by higher premium, consequently, increase profit margin Learning model building and data preparation techniques from Dr. Wang through Morgan C. Wang 2/5/2022

Challenges 1. Data Quality 2. Modeling Tool Selection Morgan C. Wang 2/5/2022 16

Challenges – Data Quality � Data Quality Issues: § Large amount of missing values § Significant number of categorical variables with high cardinality Morgan C. Wang 2/5/2022 17


Data Preparation —Missing Values � Performance Wiithout MVP 74. 04% With MVP Smoothing 74. 98% Improvement 1. 27% 8个MI变量提升效果 实际损失 1 513 1 765 8个MI变量+MVP提升效果 预测损失 2 166 2 269 2 687 2 896 4 199 3 313 3 141 1 801 1 932 2 297 2 384 2 681 1 192 274 261 1 4 094 3 482 2 3 4 5 6 7 8 9 10 实际损失 1 2 3 4 预测损失 5 6 7 8 9 10

Modeling - GLM l Before: Without Adequate Data Preparation l After: with Adequate Data Preparation Model Before After Improvement Gini C statistics 18. 25% 86. 50% 20. 28% 90. 46% 11. 12% 4. 58% Conclusion:With adequate data preparation the model performance improved。

Modeling – Neural Network l l Before: Without Adequate Data Preparation After: with Adequate Data Preparation Count Average Claim Combine Before 75. 10% 74. 04% 86. 50% After 75. 86% 74. 26% 92. 18% Improvement 1. 01% 0. 30% 6. 57% Conclusions:Use neural network model performance improved。

Pricing with Models � Good model has higher C Statistics � Higher premium for all customers with predicted risk higher than current premium � Lower pricing for all customers with predicted risk lower than current premium � Higher Gini index after new pricing Morgan C. Wang 2/5/2022 22

Case Study II Customer Risk Score 23 2/5/2022 Morgan C. Wang

Data 24 2/5/2022 Morgan C. Wang

Data available for this study � Data came from several different sources � There are 4800 Default Customers and 100, 000 Normal Customers Morgan C. Wang 2/5/2022 25

Study Goal 26 2/5/2022 Morgan C. Wang

Study Goal Utilize the hidden value in the data to � § § § Estimate the loan defaulting risk for potential customers Assess the adequate loan amount for potential customers Identify the low risk and high value new customer cluster allowing marketing and sales department to expand Morgan C. Wang 2/5/2022

Challenges 1. Data Quality 2. Modeling Tool Selection Morgan C. Wang 2/5/2022 28

Challenges 1. Data Quality 2. Modeling Tool Selection 3. Integration with Existing System Morgan C. Wang 2/5/2022 29

Challenges – Data Quality � Data Quality Issues: § Large amount of missing values § Data came from multiple sources at different time § Some out of range values § Significant number of categorical variables with high cardinality § The existing of non-linear relationship between target variable - “FLAG” and many numerical predictors. Morgan C. Wang 2/5/2022 30

Consequence– Missing Values (cont. ) � Large amount of information will lose � Prediction power will reduce � Prediction results are biased Morgan C. Wang 2/5/2022 31

Consequence – Multiple Sources � Ignoring time dimension is the major reason that model performance significant lower in future scoring data than original modeling data Morgan C. Wang 2/5/2022 32

Consequence – Out of Range Values � Reduce the credibility of the study result � Bias the model parameter estimation � Decrease future cases scoring accuracy Morgan C. Wang 2/5/2022 33

Consequence – High Cardinality � Unreliable result produced from low frequency categories � Increase model dimensionality due to high cardinality � Ignore important high cardinality variables can significant reduce model reliability Morgan C. Wang 2/5/2022 34

Data Quality – Non-linearity (cont. ) Morgan C. Wang 2/5/2022 35

Data Quality – Non-linearity (cont. ) Morgan C. Wang 2/5/2022 36

Data Quality – Non-linearity (cont. ) Morgan C. Wang 2/5/2022 37

Data Quality – Non-linearity (cont. ) Morgan C. Wang 2/5/2022 38

Data Quality – Non-linearity (cont. ) Morgan C. Wang 2/5/2022 39

Data Quality – Non-linearity (cont. ) Morgan C. Wang 2/5/2022 40

Consequence – Non-linearity � Most statistical procedures including both multiple regression and generalized linear model can not fit data adequately due to nonlinearity Morgan C. Wang 2/5/2022 41

Challenges 1. Data Quality 2. Modeling Tool Selection 3. Integration with Existing System Morgan C. Wang 2/5/2022 42

Modeling Tool Selection � Statistical modeling tools (Linear Regression, Ridge Regression, Lasso Regression, Elastic Net, Generalized Linear Model, Least Angle Regression) can not handle nonlinearity well � Tree based modeling tools (Decision Trees, Random Forest, Gradient Boosting) can not handle linearity effectively � Neural network can not handle high cardinality categorical variables very well especially when some categories have very low feequency Morgan C. Wang 2/5/2022 43

Challenges 1. Data Quality 2. Modeling Tool Selection 3. Integration with Existing System Morgan C. Wang 2/5/2022 44

Integration with Existing System � Integrate modeling results with our current system seamlessly, since I want to § Integrate the model with our IT system seamlessly § Allow sale personal to use model results to increase their sale performance § Allow marketing department to use model results to identify new marketing opportunity and to make wise marketing decision § Allow internet sales department to approve loan application in timely manner �I want to build model in timely manner and to spend more time to use the model to help my company Morgan C. Wang 2/5/2022 45

Model Performance Morgan C. Wang 2/5/2022 46

Model Performance � Model 1: Use all attributes provided by Company A � Model 2: Use all attributes except Online Consumption Data � Model 3: Use all attributes except Online Consumption and Other Consumption Data � Note: All date attributes were converted to “numerical” length before performing modeling Morgan C. Wang 2/5/2022 47

Model Performance � Model performance based on c-statistics (AUC) Morgan C. Wang 2/5/2022 48

Model Performance � Model performance based on Catch Rate Morgan C. Wang 2/5/2022 49

Model Performance � Model Performance Based on Lift Morgan C. Wang 2/5/2022 50

Model Performance � All models performance are excellent � Online Consumption Data play very limit role on improving model performance � Other Consumption Data play an important role on improving model performance Morgan C. Wang 2/5/2022 51

Questions and Next Step Morgan C. Wang 2/5/2022 52
- Slides: 52