Data Analysis Case Study Auto Claim Assignment Ming














- Slides: 14
Data Analysis Case Study – Auto Claim Assignment Ming Sun, American Family Insurance
About Myself 2014 - present 2005 -2014 1999 -2005 Application Development Solution Architecture Data Science Engineering • • • J 2 EE Web App Java Batch Processing Big Data Analytics Mobile APP Application Integrations Data Warehouse Integrations Repeatable Data Science Pipelines Exploratory Data Analysis Data Lake Design Technology Incubation
Analytical Solution Life-cycle • • Containerization CI/CD Monitor Pipelines Model Registry • Model Techniques • Model Performance • Model Pipelines Start Here Solution Deployment Model Development Problem Definition Data Preparation • • Current State Bottomline CBA Topline Benefits Data Sources • • • Data Domains Data Quality Data Design Data Blend Data Pipelines
Problem Definition Scope – Determine if a damaged vehicle should be totaled or repaired at the early stage of auto claims Current State Bottom Line CBA Top Line Benefits • Point Based Model • Accuracy < 80% • Annual savings amount • 10% lift ≈ $500 k$2 M • Impact to customer satisfaction
Problem Definition – Data Sources 3 rd normal form DB Claim System – Old (DB 2) Partial Data Claim System – New (Oracle) No Data 3 rd Party Data (daily files) Claims Data Warehouse (DB 2)
Data Preparation – Data Domains Initial Claim (7 - 10 table) Customer Satisfaction (2 files) Handling Assignment (6 - 8 table) Code Description 10+ Table 3 rd Party Loss Estimates (5 files) Total Loss Workflow (2 - 4 table) Salvage Info (2 table)
Data Preparation – Grain/Quality/Blend • The grain of blended dataset - Vehicle • Current snapshot of all closed auto collision claims • Identify keys to blend claims, 3 rd party estimates, and customer satisfaction • Profile the blended dataset: record counts, missing values, column value distribution, correlation, etc.
Problem Definition Analysis Current Process: Vehicle Questionnaire Number of questions: 17 12 Questions not answered > 80% Assignment Accuracy ≈ 80% Assigned Repairable, actual Total Loss ≈ 2 x % Assigned Total Loss, actual Repairable ≈ x % Mis-assigned Claim Costs Assigned Repairable, actual Total Loss ≈ $ 3 y per claim Assigned Total Loss, actual Repairable ≈ $ y per claim
Customer Satisfaction Impact Analysis • 5 satisfaction score buckets with 5 being the most satisfied • False Positives have the worst impact, followed by False Negatives • Customers are happy with True Negatives
Model Development Models Misclassification Rate ROC Random Forest 0. 136 0. 90 Logistic Regression 0. 145 0. 89 Comparison Category Which Model is Better Technical Performance Random Forest Implementation Cost Logistic Regression (200 vs 1000 hours) Annual Saving Forecast tie Winner – Logistic Regression
Model Development – Cont’d low Scores Repairable Cutoff Point Manual Review Total Loss Cutoff Point high
Solution Deployment Logistic Regression Points Assignment UI Claim System - New Simplified Vehicle Questionnaire • Questions: 12 8 • Answers: Y/N List of Choices
Takeaways • Data analysis is critical throughout • Keep the data scope reasonable • Deep knowledge of business process and data • Ease of implementation over model techniques • Be conservative when estimating savings • Pilot the solution first for 3 -6 months to test • It is a team effort (analysts, engineers, scientists)
Parting Thought – Data Preparation • • Most time consuming work Tedious and not glamourous Foundational work – Data Lake Venerable of being the scapegoat