Data Analysis for Credit Card Fraud Detection Alejandro

  • Slides: 23
Download presentation
Data Analysis for Credit Card Fraud Detection Alejandro Correa Bahnsen Luxembourg University

Data Analysis for Credit Card Fraud Detection Alejandro Correa Bahnsen Luxembourg University

Introduction Europe fraud evolution Internet transactions (millions of euros) € 800 € 700 €

Introduction Europe fraud evolution Internet transactions (millions of euros) € 800 € 700 € 600 € 500 2007 2008 2009 2010 2011 E 2012 E

Introduction US fraud evolution Online revenue lost due to fraud (Billions of dollars) $

Introduction US fraud evolution Online revenue lost due to fraud (Billions of dollars) $ 5, 0 $ 4, 0 $ 3, 0 $ 2, 0 $ 1, 0 $2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Simplify transaction flow Network Fraud? ?

Simplify transaction flow Network Fraud? ?

Agenda • • • Introduction Database Evaluation of algorithms Logistic Regression Financial measure Cost

Agenda • • • Introduction Database Evaluation of algorithms Logistic Regression Financial measure Cost Sensitive Logistic Regression

Database • Larger European card processing company • 2012 card present transactions • 750,

Database • Larger European card processing company • 2012 card present transactions • 750, 000 Transactions • 3500 Frauds • 0. 467% Fraud rate Test Dec Nov Oct Sep Aug Jul • 148, 562 EUR lost due to fraud on test dataset Jun May Apr Mar Feb Jan Train

Database • Raw attributes TRXID Client ID Date Amount Location Type Merchant Group Fraud

Database • Raw attributes TRXID Client ID Date Amount Location Type Merchant Group Fraud 1 1 2/1/12 6: 00 580 Lux Internet Airlines No 2 1 2/1/12 6: 15 120 Lux Present Car Renting No 3 2 2/1/12 8: 20 12 Bel Present Hotel Yes 4 1 3/1/12 4: 15 60 Lux ATM No 5 2 3/1/12 9: 18 8 Fra Present Retail No 6 1 3/1/12 9: 55 1210 Lux Internet Airlines Yes • Other attributes: Age, country of residence, postal code, type of card 7

Database • Derived attributes Fraud No. of Trx – same client – last 6

Database • Derived attributes Fraud No. of Trx – same client – last 6 hour Sum – same client – last 7 days Airlines No 0 0 Present Car Renting No 1 580 Bel Present Hotel Yes 0 0 60 Lux ATM No 0 700 3/1/12 9: 18 8 Fra Present Retail No 0 12 3/1/12 9: 55 1210 Lux Internet Airlines Yes 1 760 ID Num CC Type Merchant Group Date Amt Location 1 1 2/1/12 6: 00 580 Lux Internet 2 1 2/1/12 6: 15 120 Lux 3 2 2/1/12 8: 20 12 4 1 3/1/12 4: 15 5 2 6 1 By Combination of following criteria: Client Credit Card Group Last None hour Transaction Type day Merchant week Merchant Category month Merchant Group 1 3 months Merchant Group 2 Merchant Country Function Count Sum(Amount) Avg(Amount) 8

Evaluation • Confusion matrix TP FP FN TN

Evaluation • Confusion matrix TP FP FN TN

Agenda • • • Introduction Database Evaluation of algorithms Logistic Regression Financial measure Cost

Agenda • • • Introduction Database Evaluation of algorithms Logistic Regression Financial measure Cost Sensitive Logistic Regression

Logistic Regression • Model • Cost Function • Cost Matrix 0 1 1 0

Logistic Regression • Model • Cost Function • Cost Matrix 0 1 1 0

Logistic Regression Under sampling procedure 0. 467% 1% 5% 10% 20% 50% Select all

Logistic Regression Under sampling procedure 0. 467% 1% 5% 10% 20% 50% Select all the frauds and a random sample of the legitimate transactions.

Logistic Regression Results 70% 60% 50% 40% 30% 20% 10% 0% No Model All

Logistic Regression Results 70% 60% 50% 40% 30% 20% 10% 0% No Model All Recall 1% Precision 5% 10% Miss-cla 20% F 1 -Score 50%

Financial evaluation • Motivation • False positives carry a different cost than false negatives

Financial evaluation • Motivation • False positives carry a different cost than false negatives • Frauds range from few to thousands of euros (dollars, pounds, etc) There is a need for a real comparison measure

Financial evaluation • Cost matrix where: • Evaluation measure Ca Ca Amt 0 Ca

Financial evaluation • Cost matrix where: • Evaluation measure Ca Ca Amt 0 Ca Amt Administrative costs Amount of transaction i

Logistic Regression Results 70% € 148 562€ 148 196 € 142 510 60% €

Logistic Regression Results 70% € 148 562€ 148 196 € 142 510 60% € 112 103 50% € 160 000 € 140 000 € 120 000 € 100 000 € 79 838 40% € 80 000 € 65 870 30% € 46 530 20% € 40 000 10% 0% € 60 000 € 20 000 No Model All 1% Cost 5% Recall 10% Precision 20% F 1 -Score Selecting the algorithm by by F 1 -Score Cost Selecting 50% € -

Logistic Regression • Best model selected using traditional F 1 -Score does not give

Logistic Regression • Best model selected using traditional F 1 -Score does not give the best results in terms of cost • Model selected by cost, is trained using less than 1% of the database, meaning there is a lot of information excluded • The algorithm is trained to minimize the miss-classification (approx. ) but then is evaluated based on cost • Why not train the algorithm to minimize the cost instead?

Cost Sensitive Logistic Regression • Cost Matrix • Cost Function Ca Ca Amt 0

Cost Sensitive Logistic Regression • Cost Matrix • Cost Function Ca Ca Amt 0

Cost sensitive Logistic Regression Results 100% € 148 562 90% 80% 70% 60% €

Cost sensitive Logistic Regression Results 100% € 148 562 90% 80% 70% 60% € 85 724 € 73 772 50% € 66 245€ 67 264 40% 30% € 37 785 € 31 174 20% 10% 0% No All 1% 5% 10% 20% 50% Model Cost Recall Precision F 1 -Score € 160 000 € 140 000 € 120 000 € 100 000 € 80 000 € 60 000 € 40 000 € 20 000 € -

Cost sensitive Logistic Regression Results 80% 70% 60% 50% 40% 30% 20% 10% 0%

Cost sensitive Logistic Regression Results 80% 70% 60% 50% 40% 30% 20% 10% 0% € 148 562 € 46 530 No Model Cost Logistic Regression Recall € 31 174 Cost Sensitive Logistic Regression Precision F 1 -Score € € € € € 160 000 140 000 120 000 100 000 80 000 60 000 40 000 20 000 -

Conclusion • Selecting models based on traditional statistics does not give the best results

Conclusion • Selecting models based on traditional statistics does not give the best results in terms of cost • Models should be evaluated taking into account real financial costs of the application • Algorithms should be developed to incorporate those financial costs

Thank you!

Thank you!

Contact information Alejandro Correa Bahnsen University of Luxembourg al. bahnsen@gmail. com http: //www. linkedin.

Contact information Alejandro Correa Bahnsen University of Luxembourg al. bahnsen@gmail. com http: //www. linkedin. com/in/albahnsen http: //www. slideshare. net/albahnsen