A mathematical model for fraud prediction and control
- Slides: 33
A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università degli Studi di Firenze -Herberth Espinoza (Peru). Universidad Complutense de Madrid -Antoine Levitt (France). University of Oxford -María Loreto Luque (Spain). Universidad Complutense de Madrid -Carlos Parra (Spain). Universidad Complutense de Madrid -Aranzazu Pérez (Spain). Universidad Complutense de Madrid -Elisa Pérez (Spain) Universidad Complutense de Madrid Coordinators: -Dr. Benjamin Ivorra (Universidad Complutense de Madrid) -Dr. Juan Tejada (Universidad Complutense de Madrid) -D. Jorge Juan Suerias (Neo Metrics) -D. Fernando Fernández (Neo Metrics) -Pilar Gómez (Neo Metrics) neometrics Client logo
Outline • • • Problem Description Data Analysis Mathematical Modeling Numerical Validation Conclusions and Perspectives neometrics Client logo
I – PROBLEM DESCRIPTION neometrics Client logo
Context • An electrical company keeps a crew of inspectors in Chile to check whether customers are manipulating their electrical meters. • Each check has a cost and the company wishes to identify the customers with higher risk of fraud in order to maximize benefits in a possible control campaign. neometrics 4
Objectives of the work • Currently, the company policy to check randomly on customers detects 6. 6% of fraud. • Using the data set provided by the company, we created a model to design an improved control campaign. To do so, we have: Data exploration and treatment using SAS. Fit a logistic regression model using SAS and calculate the lift-chart/ROC model performance indicators. According to the model find the optimal control campaign, using Matlab, and perform a validation of the results. neometrics 5
II – DATA ANALYSIS neometrics Client logo
DATA ANALYSIS What does Neometrics supply us? We received from Neometrics the database, train. csv Database characteristics: – 79, 459 records – 49 variables – 30 MB neometrics 7
DATA DESCRIPTION We have 49 explanatory variables which can be divided in several groups. We consider some variables referring to each customers characteristic (economical, geographical, technical…) We select and simplify the more representing variables. To do so: – We categorize non-linear continuous variables according to fraud proportion in population. – We analyze the groups that have both categorical and quantitative variables in discrimination techniques (Binary tree). – Groups containing only quantitative variables apply principal components (PCA) to see these correlations. neometrics 8
Variable simplification In order to identify the non-linear continuous variables and representative values, we calculate: • Population proportion on each group: • Fraud proportion on each group: neometrics 9
Variable simplification neometrics 10
Variable simplification neometrics 11
Variable simplification neometrics 12
Variable simplification neometrics 13
VARIABLES SELECTION The principal component analysis (PCA) is applied to the Debt and Payment group variable. The aim of PCA is to reduce the size of the observed variables for each individual, keeping the greatest variability. PCA is based on the spectral analysis of the correlation matrix. This process was performed using SAS. neometrics 14
SELECTION OF VARIABLES We start with the principal component analysis for groups of variables: Calculated variables of Debt and Payment. The most relevant variables are: deuda_ult_mean, max_deuda, dif_ult_mean_deuda mean_dif_deuda. Repeating the process After finishing the procedure, we select three variables, accounting for 90% of the information. neometrics 15
Study Of Significance • • We realize a study on the most significant variables that we might include in our model. For example: Geographic Variables, Conexion Identifiers, Customer Caracteristic Groups. proc logistic; selection = stepwise • Pearson Correlation Coeficient neometrics 16
Study Of Multicolinearity • VIF = variance inflaction • • The VIF represents an increase in the variance due to presence of multicollinearity. VIF take values from a minimum of 1 when there is no degree of multicollinearity àOther procedures: Discriminal Analysis (DISCRIM) • max_deuda neometrics 17
Classification trees To obtain a good predictive model Decision support models that can be applied in the identification of fraudsters neometrics 18
• III – MATHEMATICAL MODELING neometrics Client logo
Regression Predict a random variables with a set of explanatory In our case, predict fraud probability using client information Well-studied problem, numerous applications (spam filter, image classification, insurance policy. . . ) Simplest idea: linear regression neometrics 20
Logistic regression is a probability : respect the constraint. . Linear regression does not Map result of linear regression to a probability with logistic curve We calibrate model parameters on the training set (find optimal a and b), and test on the validation set. Use an optimisation procedure to fit model Other possibilities: add terms in the model. Interactions ( nonlinearities ( ), etc. neometrics etc. ), 21
Model evaluation In order to validate our model: We compare estimated fraud probabilities to measured outcomes on the validation set. Use different metrics: ROC curve, lift-chart, etc. . Use the results to establish optimal control policy neometrics 22
ROC curve X-axis: false positives, Y-axis : true positives For a given allowed false positive rate (specificity), determine success rate (sensitivity) Perfect model: Y = 1, random model, Y = X, bad model, Y = 0 Goal: get above Y = X Area under curve good indication of quality of model: c-value neometrics 23
LIFT CHART We sort the client according to their decreasing fraud probabilities. The X-axis represent the percentage of the population according to the previous arrangement. The Y-axis represent a rate calculated as: Rate (α) = neometrics 24
• IV – NUMERICAL VALIDATION neometrics Client logo
Results • • Use only a subset of available parameters Zone, type of line, past fraud history, past payment history Categorise continuous variables Final model: 8 categorical variables, with or without pairwise interactions • Model trained on 40 k clients, validated on 40 k others. neometrics 26
LIFT-Chart graph and ROC values neometrics 27
Cost-benefit analysis – Best model Train neometrics Validation 28
Cost-benefit analysis – Extreme models Best model neometrics Random model 29
• V – CONCLUSIONS AND PERSPECTIVES neometrics Client logo
Summary • Data analysis to isolate interesting variables • Logistic regression to predict fraud probabilities • Evaluation of the model (ROC, lift) • Use in a cost-benefit analysis • Concrete results of use to the client neometrics Client logo
Future Work • Automatic data analysis (binary trees, principal components analysis, etc. ) • Carefully chose variables to include • Other types of prediction (neural networks, decision trees, etc. ) • More complex optimization processes neometrics Client logo
THANK YOU neometrics 33
- How do fraud symptoms help in detecting fraud
- Mathematical models for impact prediction
- Ethics fraud and internal control
- Fraud internal control and cash
- Chapter 3 ethics fraud and internal control
- Chapter 7 fraud internal control and cash
- Chapter 7 fraud internal control and cash
- Imprest system
- Internal control shield
- What is mathematical economics
- Louisiana medicaid fraud control unit
- How did tycho brahe and kepler employ the scientific method
- Attrition prediction model in excel
- Ecommerce security issues
- Verifyny
- Fraud and abuse module
- E commerce security and fraud protection
- E-commerce security and fraud protection
- Computer fraud and abuse techniques
- Anti bribery and corruption analytics
- Fraud waste and abuse training answers
- Fraud waste and abuse training answers
- Computer fraud and security
- Difference between fraud and misrepresentation
- Kontinuitetshantering i praktiken
- Typiska drag för en novell
- Tack för att ni lyssnade bild
- Vad står k.r.å.k.a.n för
- Varför kallas perioden 1918-1939 för mellankrigstiden?
- En lathund för arbete med kontinuitetshantering
- Kassaregister ideell förening
- Tidbok
- Sura för anatom
- Densitet vatten