Data Analysis for Credit Card Fraud Detection Alejandro
- Slides: 23
Data Analysis for Credit Card Fraud Detection Alejandro Correa Bahnsen Luxembourg University
Introduction Europe fraud evolution Internet transactions (millions of euros) € 800 € 700 € 600 € 500 2007 2008 2009 2010 2011 E 2012 E
Introduction US fraud evolution Online revenue lost due to fraud (Billions of dollars) $ 5, 0 $ 4, 0 $ 3, 0 $ 2, 0 $ 1, 0 $2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Simplify transaction flow Network Fraud? ?
Agenda • • • Introduction Database Evaluation of algorithms Logistic Regression Financial measure Cost Sensitive Logistic Regression
Database • Larger European card processing company • 2012 card present transactions • 750, 000 Transactions • 3500 Frauds • 0. 467% Fraud rate Test Dec Nov Oct Sep Aug Jul • 148, 562 EUR lost due to fraud on test dataset Jun May Apr Mar Feb Jan Train
Database • Raw attributes TRXID Client ID Date Amount Location Type Merchant Group Fraud 1 1 2/1/12 6: 00 580 Lux Internet Airlines No 2 1 2/1/12 6: 15 120 Lux Present Car Renting No 3 2 2/1/12 8: 20 12 Bel Present Hotel Yes 4 1 3/1/12 4: 15 60 Lux ATM No 5 2 3/1/12 9: 18 8 Fra Present Retail No 6 1 3/1/12 9: 55 1210 Lux Internet Airlines Yes • Other attributes: Age, country of residence, postal code, type of card 7
Database • Derived attributes Fraud No. of Trx – same client – last 6 hour Sum – same client – last 7 days Airlines No 0 0 Present Car Renting No 1 580 Bel Present Hotel Yes 0 0 60 Lux ATM No 0 700 3/1/12 9: 18 8 Fra Present Retail No 0 12 3/1/12 9: 55 1210 Lux Internet Airlines Yes 1 760 ID Num CC Type Merchant Group Date Amt Location 1 1 2/1/12 6: 00 580 Lux Internet 2 1 2/1/12 6: 15 120 Lux 3 2 2/1/12 8: 20 12 4 1 3/1/12 4: 15 5 2 6 1 By Combination of following criteria: Client Credit Card Group Last None hour Transaction Type day Merchant week Merchant Category month Merchant Group 1 3 months Merchant Group 2 Merchant Country Function Count Sum(Amount) Avg(Amount) 8
Evaluation • Confusion matrix TP FP FN TN
Agenda • • • Introduction Database Evaluation of algorithms Logistic Regression Financial measure Cost Sensitive Logistic Regression
Logistic Regression • Model • Cost Function • Cost Matrix 0 1 1 0
Logistic Regression Under sampling procedure 0. 467% 1% 5% 10% 20% 50% Select all the frauds and a random sample of the legitimate transactions.
Logistic Regression Results 70% 60% 50% 40% 30% 20% 10% 0% No Model All Recall 1% Precision 5% 10% Miss-cla 20% F 1 -Score 50%
Financial evaluation • Motivation • False positives carry a different cost than false negatives • Frauds range from few to thousands of euros (dollars, pounds, etc) There is a need for a real comparison measure
Financial evaluation • Cost matrix where: • Evaluation measure Ca Ca Amt 0 Ca Amt Administrative costs Amount of transaction i
Logistic Regression Results 70% € 148 562€ 148 196 € 142 510 60% € 112 103 50% € 160 000 € 140 000 € 120 000 € 100 000 € 79 838 40% € 80 000 € 65 870 30% € 46 530 20% € 40 000 10% 0% € 60 000 € 20 000 No Model All 1% Cost 5% Recall 10% Precision 20% F 1 -Score Selecting the algorithm by by F 1 -Score Cost Selecting 50% € -
Logistic Regression • Best model selected using traditional F 1 -Score does not give the best results in terms of cost • Model selected by cost, is trained using less than 1% of the database, meaning there is a lot of information excluded • The algorithm is trained to minimize the miss-classification (approx. ) but then is evaluated based on cost • Why not train the algorithm to minimize the cost instead?
Cost Sensitive Logistic Regression • Cost Matrix • Cost Function Ca Ca Amt 0
Cost sensitive Logistic Regression Results 100% € 148 562 90% 80% 70% 60% € 85 724 € 73 772 50% € 66 245€ 67 264 40% 30% € 37 785 € 31 174 20% 10% 0% No All 1% 5% 10% 20% 50% Model Cost Recall Precision F 1 -Score € 160 000 € 140 000 € 120 000 € 100 000 € 80 000 € 60 000 € 40 000 € 20 000 € -
Cost sensitive Logistic Regression Results 80% 70% 60% 50% 40% 30% 20% 10% 0% € 148 562 € 46 530 No Model Cost Logistic Regression Recall € 31 174 Cost Sensitive Logistic Regression Precision F 1 -Score € € € € € 160 000 140 000 120 000 100 000 80 000 60 000 40 000 20 000 -
Conclusion • Selecting models based on traditional statistics does not give the best results in terms of cost • Models should be evaluated taking into account real financial costs of the application • Algorithms should be developed to incorporate those financial costs
Thank you!
Contact information Alejandro Correa Bahnsen University of Luxembourg al. bahnsen@gmail. com http: //www. linkedin. com/in/albahnsen http: //www. slideshare. net/albahnsen
- Sequence diagram for credit card fraud detection
- Data driven fraud detection
- How do fraud symptoms help in detecting fraud
- 2-1 checking accounts answers
- Advantages of debit card
- Fraud detection
- Workers compensation fraud detection
- Fraud detection conference
- This can be avoided by giving credit where credit is due.
- University of washington credit card
- Randolph brooks credit card
- Mythbusters credit card
- Wells fargo gateway
- Ravi credit card
- Credit card trivia
- Raspberry pi credit card
- Credit card activity worksheet
- What is upc
- Tomikas credit rating was lowered
- Credit cards 101
- Vnotify
- 4-5 credit card statement
- Introduction to consumer credit 4-1
- 3-6 credit card statement answers