A mathematical model for fraud prediction and control

  • Slides: 33
Download presentation
A mathematical model for fraud prediction and control planning of electrical company clients June

A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università degli Studi di Firenze -Herberth Espinoza (Peru). Universidad Complutense de Madrid -Antoine Levitt (France). University of Oxford -María Loreto Luque (Spain). Universidad Complutense de Madrid -Carlos Parra (Spain). Universidad Complutense de Madrid -Aranzazu Pérez (Spain). Universidad Complutense de Madrid -Elisa Pérez (Spain) Universidad Complutense de Madrid Coordinators: -Dr. Benjamin Ivorra (Universidad Complutense de Madrid) -Dr. Juan Tejada (Universidad Complutense de Madrid) -D. Jorge Juan Suerias (Neo Metrics) -D. Fernando Fernández (Neo Metrics) -Pilar Gómez (Neo Metrics) neometrics Client logo

Outline • • • Problem Description Data Analysis Mathematical Modeling Numerical Validation Conclusions and

Outline • • • Problem Description Data Analysis Mathematical Modeling Numerical Validation Conclusions and Perspectives neometrics Client logo

I – PROBLEM DESCRIPTION neometrics Client logo

I – PROBLEM DESCRIPTION neometrics Client logo

Context • An electrical company keeps a crew of inspectors in Chile to check

Context • An electrical company keeps a crew of inspectors in Chile to check whether customers are manipulating their electrical meters. • Each check has a cost and the company wishes to identify the customers with higher risk of fraud in order to maximize benefits in a possible control campaign. neometrics 4

Objectives of the work • Currently, the company policy to check randomly on customers

Objectives of the work • Currently, the company policy to check randomly on customers detects 6. 6% of fraud. • Using the data set provided by the company, we created a model to design an improved control campaign. To do so, we have: Data exploration and treatment using SAS. Fit a logistic regression model using SAS and calculate the lift-chart/ROC model performance indicators. According to the model find the optimal control campaign, using Matlab, and perform a validation of the results. neometrics 5

II – DATA ANALYSIS neometrics Client logo

II – DATA ANALYSIS neometrics Client logo

DATA ANALYSIS What does Neometrics supply us? We received from Neometrics the database, train.

DATA ANALYSIS What does Neometrics supply us? We received from Neometrics the database, train. csv Database characteristics: – 79, 459 records – 49 variables – 30 MB neometrics 7

DATA DESCRIPTION We have 49 explanatory variables which can be divided in several groups.

DATA DESCRIPTION We have 49 explanatory variables which can be divided in several groups. We consider some variables referring to each customers characteristic (economical, geographical, technical…) We select and simplify the more representing variables. To do so: – We categorize non-linear continuous variables according to fraud proportion in population. – We analyze the groups that have both categorical and quantitative variables in discrimination techniques (Binary tree). – Groups containing only quantitative variables apply principal components (PCA) to see these correlations. neometrics 8

Variable simplification In order to identify the non-linear continuous variables and representative values, we

Variable simplification In order to identify the non-linear continuous variables and representative values, we calculate: • Population proportion on each group: • Fraud proportion on each group: neometrics 9

Variable simplification neometrics 10

Variable simplification neometrics 10

Variable simplification neometrics 11

Variable simplification neometrics 11

Variable simplification neometrics 12

Variable simplification neometrics 12

Variable simplification neometrics 13

Variable simplification neometrics 13

VARIABLES SELECTION The principal component analysis (PCA) is applied to the Debt and Payment

VARIABLES SELECTION The principal component analysis (PCA) is applied to the Debt and Payment group variable. The aim of PCA is to reduce the size of the observed variables for each individual, keeping the greatest variability. PCA is based on the spectral analysis of the correlation matrix. This process was performed using SAS. neometrics 14

SELECTION OF VARIABLES We start with the principal component analysis for groups of variables:

SELECTION OF VARIABLES We start with the principal component analysis for groups of variables: Calculated variables of Debt and Payment. The most relevant variables are: deuda_ult_mean, max_deuda, dif_ult_mean_deuda mean_dif_deuda. Repeating the process After finishing the procedure, we select three variables, accounting for 90% of the information. neometrics 15

Study Of Significance • • We realize a study on the most significant variables

Study Of Significance • • We realize a study on the most significant variables that we might include in our model. For example: Geographic Variables, Conexion Identifiers, Customer Caracteristic Groups. proc logistic; selection = stepwise • Pearson Correlation Coeficient neometrics 16

Study Of Multicolinearity • VIF = variance inflaction • • The VIF represents an

Study Of Multicolinearity • VIF = variance inflaction • • The VIF represents an increase in the variance due to presence of multicollinearity. VIF take values from a minimum of 1 when there is no degree of multicollinearity àOther procedures: Discriminal Analysis (DISCRIM) • max_deuda neometrics 17

Classification trees To obtain a good predictive model Decision support models that can be

Classification trees To obtain a good predictive model Decision support models that can be applied in the identification of fraudsters neometrics 18

 • III – MATHEMATICAL MODELING neometrics Client logo

• III – MATHEMATICAL MODELING neometrics Client logo

Regression Predict a random variables with a set of explanatory In our case, predict

Regression Predict a random variables with a set of explanatory In our case, predict fraud probability using client information Well-studied problem, numerous applications (spam filter, image classification, insurance policy. . . ) Simplest idea: linear regression neometrics 20

Logistic regression is a probability : respect the constraint. . Linear regression does not

Logistic regression is a probability : respect the constraint. . Linear regression does not Map result of linear regression to a probability with logistic curve We calibrate model parameters on the training set (find optimal a and b), and test on the validation set. Use an optimisation procedure to fit model Other possibilities: add terms in the model. Interactions ( nonlinearities ( ), etc. neometrics etc. ), 21

Model evaluation In order to validate our model: We compare estimated fraud probabilities to

Model evaluation In order to validate our model: We compare estimated fraud probabilities to measured outcomes on the validation set. Use different metrics: ROC curve, lift-chart, etc. . Use the results to establish optimal control policy neometrics 22

ROC curve X-axis: false positives, Y-axis : true positives For a given allowed false

ROC curve X-axis: false positives, Y-axis : true positives For a given allowed false positive rate (specificity), determine success rate (sensitivity) Perfect model: Y = 1, random model, Y = X, bad model, Y = 0 Goal: get above Y = X Area under curve good indication of quality of model: c-value neometrics 23

LIFT CHART We sort the client according to their decreasing fraud probabilities. The X-axis

LIFT CHART We sort the client according to their decreasing fraud probabilities. The X-axis represent the percentage of the population according to the previous arrangement. The Y-axis represent a rate calculated as: Rate (α) = neometrics 24

 • IV – NUMERICAL VALIDATION neometrics Client logo

• IV – NUMERICAL VALIDATION neometrics Client logo

Results • • Use only a subset of available parameters Zone, type of line,

Results • • Use only a subset of available parameters Zone, type of line, past fraud history, past payment history Categorise continuous variables Final model: 8 categorical variables, with or without pairwise interactions • Model trained on 40 k clients, validated on 40 k others. neometrics 26

LIFT-Chart graph and ROC values neometrics 27

LIFT-Chart graph and ROC values neometrics 27

Cost-benefit analysis – Best model Train neometrics Validation 28

Cost-benefit analysis – Best model Train neometrics Validation 28

Cost-benefit analysis – Extreme models Best model neometrics Random model 29

Cost-benefit analysis – Extreme models Best model neometrics Random model 29

 • V – CONCLUSIONS AND PERSPECTIVES neometrics Client logo

• V – CONCLUSIONS AND PERSPECTIVES neometrics Client logo

Summary • Data analysis to isolate interesting variables • Logistic regression to predict fraud

Summary • Data analysis to isolate interesting variables • Logistic regression to predict fraud probabilities • Evaluation of the model (ROC, lift) • Use in a cost-benefit analysis • Concrete results of use to the client neometrics Client logo

Future Work • Automatic data analysis (binary trees, principal components analysis, etc. ) •

Future Work • Automatic data analysis (binary trees, principal components analysis, etc. ) • Carefully chose variables to include • Other types of prediction (neural networks, decision trees, etc. ) • More complex optimization processes neometrics Client logo

THANK YOU neometrics 33

THANK YOU neometrics 33