Knowledge and Discovery in Databases Analyzing Loan Data
Knowledge and Discovery in Databases Analyzing Loan Data HEENA KHAN MADLEN IVANOVA MAHUA DUTTA VAIJYANT TOMAR
OVERVIEW BUSINESS UNDERSTANDING OBJECTIVES DATA UNDERSTANDING SOFTWARE USED DATA PREPARATION MODELING MODEL EVALUATION USE OF MODELS – CASE ANALYSIS
WHY DOES THE FINANCIAL INSTITUTION NEED TO KNOW THE WHAT FACTORS ARE DRIVING THE LOAN STATUS?
MAKE THE RIGHT CHOICE PAID AFTER COLLECTION PAID COLLECTION
Follow the industry standard Business/ Research Understanding Data Understanding Use of Models CRISP-DM Model Evaluation Data Preparation Modeling
OVERVIEW BUSINESS UNDERSTANDING OBJECTIVES DATA UNDERSTANDING SOFTWARE USED DATA PREPARATION MODELING MODEL EVALUATION USE OF MODELS
OBJECTIVES • Identify distinguishing groups among the borrowers. • Find what are the important factors that help us predict if a borrower will pay-off/ not pay-off her/his loan? • Find what are the factors that help us predict if a borrower will repay the loan on time? • Find what borrowers are more likely to pay-off the loan - Association analysis.
OVERVIEW BUSINESS UNDERSTANDING OBJECTIVES DATA UNDERSTANDING SOFTWARE USED DATA PREPARATION MODELING MODEL EVALUATION USE OF MODELS
DATA UNDERSTANDING • Data source • Evaluate the data quality • EDA Data information: – Number of attributes : 11 – Instances: 500 – “Date” type attributes - 3 – Numeric type attributes – 3 – Categorical type attributes – 5 Target attributes: Loan_status Paid_status (feature engineered)
LOAN DATA
HOW DO THE AGE AND GENDER COMPARE TO PAST DUE DATE?
HOW DOES THE NUMBER OF LOANS COMPARE TO THE PRINCIPAL AND LOAN STATUS?
HOW DOES THE NUMBER OF LOANS RELATE TO THE BORROWER EDUCATION AND LOAN STATUS?
OVERVIEW BUSINESS UNDERSTANDING OBJECTIVES DATA UNDERSTANDING SOFTWARE USED DATA PREPARATION MODELING MODEL EVALUATION USE OF MODELS
SOFTWARE USED 1) R, R STUDIO 2) SAS VISUAL ANALYTICS & SAS MINER 3) TABLEAU 4) IBM WATSON ANALYTICS
OVERVIEW BUSINESS UNDERSTANDING OBJECTIVES DATA UNDERSTANDING SOFTWARE USED DATA PREPARATION MODELING MODEL EVALUATION USE OF MODELS
DATA PREPARATION Data cleaning – – – Check for null values Check for outliers Remove irrelevant features Variable transformation – – Transform the categorical variables (dummy coding) Standardize the numerical variables if needed Binning – age Transform past_due_days to a binary attribute - Paid_status Feature engineering – Retrieve the duration out of the date type attributes Target attributes: • Loan_status categorical • Paid_status - flag • Collection vs. Paid - flag
OVERVIEW BUSINESS UNDERSTANDING OBJECTIVES DATA UNDERSTANDING SOFTWARE USED DATA PREPARATION MODELING MODEL EVALUATION USE OF MODELS
MODELING • Clustering - K-means
SAS EXAMPLE
SAS EXAMPLE(count. )
FEATURE SELECTION - BORUTA
LOGISTIC REGRESSION
DECISION TREE
ASSOCIATION RULES
OVERVIEW BUSINESS UNDERSTANDING OBJECTIVES DATA UNDERSTANDING SOFTWARE USED DATA PREPARATION MODELING MODEL EVALUATION USE OF MODELS
MODEL EVALUATION First objective – Clustering Second objective – Do not have enough data Third objective: • Compare the model’s accuracy • Compare the Lift curves • Compare the ROC plots • Select the best one for each objective Forth objective – look at the confidence, support and Lift
LOGISTIC REGRESSION AUC of 0. 7298
DECISION TREE AUC of 0. 5404
RANDOM FOREST AUC of 0. 7491
MODEL RANKING REGRESSION 2 RANDOM FOREST 1 Based on the model ACCURACY, we choose the Random Forest as the best performing one. DECISION TREE 3
OVERVIEW BUSINESS UNDERSTANDING OBJECTIVES DATA UNDERSTANDING SOFTWARE USED DATA PREPARATION MODELING MODEL EVALUATION USE OF MODELS
USE OF MODELS Identify distinguishing groups among the borrowers - Clustering Find what are the important factors that help us predict if a borrower will pay-off/ not pay-off her/his loan? - Gender, education Find what are the factors that help us predict if a borrower will repay the loan on time? - Decision tree to predict Find what borrowers are more likely to pay-off the loan. Association analysis.
Limitations • Data Size • Number of attributes • Time constraint
REFFERENCES • • • https: //www. r-bloggers. com/ http: //www. statmethods. net/stats/index. html https: //www. kaggle. com/
- Slides: 37