Identifying Suspicious URLs An Application of LargeScale Online

  • Slides: 24
Download presentation
Identifying Suspicious URLs: An Application of Large-Scale Online Learning Justin Ma, Lawrence Saul, Stefan

Identifying Suspicious URLs: An Application of Large-Scale Online Learning Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for ICML 2009 June 15, 2009

Detecting Malicious Web Sites URL = Uniform Resource • Safe. Locator URL? • Web

Detecting Malicious Web Sites URL = Uniform Resource • Safe. Locator URL? • Web exploit? http: //www. cs. mcgill. ca/~icml 2009/abstracts. html • Spam-advertised site? http: //www. bfuduuioo 1 fp. mobi/ws/ebayisapi. dll • Phishing site? http: //fblight. com http: //mail. ru Predict what is safe without committing to risky actions 2

Problem in a Nutshell URL features to identify malicious Web sites Different classes of

Problem in a Nutshell URL features to identify malicious Web sites Different classes of URLs Benign, spam, phishing, exploits, scams. . . For now, distinguish benign vs. malicious facebook. com fblight. com 3

Today's Talk Problem Approach Learning to detect malicious URLs Challenges: scale and non-stationarity Evaluations

Today's Talk Problem Approach Learning to detect malicious URLs Challenges: scale and non-stationarity Evaluations Need for large, fresh training sets Online learning Conclusion 4

State of the Practice Current approaches Blacklists Learning on hand-tuned features Limitations Cannot learn

State of the Practice Current approaches Blacklists Learning on hand-tuned features Limitations Cannot learn from newest examples quickly Cannot quickly adapt to newest features Arms race: fast feedback cycle is critical More automated approach? 5

Live URL Classification System Label Example Hypothesis 6

Live URL Classification System Label Example Hypothesis 6

Live Training Feed Malicious URLs (spamming and phishing) 6, 000— 7, 500 per day

Live Training Feed Malicious URLs (spamming and phishing) 6, 000— 7, 500 per day from Web mail provider Benign URLs From Yahoo Web directory Total of 20, 000 URLs per day Live collection since Jan. 5, 2009 Months of data Two million examples after 100 days 7

Feature vector construction http: //www. bfuduuioo 1 fp. mobi/ws/ebayisapi. dll WHOIS registration: 3/25/2009 Hosted

Feature vector construction http: //www. bfuduuioo 1 fp. mobi/ws/ebayisapi. dll WHOIS registration: 3/25/2009 Hosted from 208. 78. 240. 0/22 IP hosted in San Mateo Connection speed: T 1 Has DNS PTR record? Yes Registrant “Chad”. . . [__ … Real-valued 60+ features 000111… 1 0 Host-based 1. 8 million Day 100 1 Lexical 1. 1 million 1 …] GROWING 8

Live URL Classficiation System Online learning 9

Live URL Classficiation System Online learning 9

Practical Challenges of ML in Systems Industrial concerns Scale: millions of examples, features Non-stationarity:

Practical Challenges of ML in Systems Industrial concerns Scale: millions of examples, features Non-stationarity: examples change over time (arms race w/ criminals) Pivotal decision: batch or online? 10

Batch vs. Online Learning Batch/offline learning SVM, logistic regression, decision trees, etc Online learning

Batch vs. Online Learning Batch/offline learning SVM, logistic regression, decision trees, etc Online learning Perceptron-style algorithms Multiple passes over data Single pass over data No incremental updates Incremental updates Potentially high memory and processing overhead Low memory and processing overheard Online learning addresses scale and non-stationarity 11

Evaluations Online learning for URL reputation Need for large, fresh training sets Comparing online

Evaluations Online learning for URL reputation Need for large, fresh training sets Comparing online algorithms 12

Need lots of fresh training data? SVM trained once SVM retrained daily Fresh data

Need lots of fresh training data? SVM trained once SVM retrained daily Fresh data helps 13

Need lots of fresh training data? SVM trained once on 2 weeks SVM w/

Need lots of fresh training data? SVM trained once on 2 weeks SVM w/ 2 -week sliding window Fresh data helps More data helps 14

Which online algorithm? Perceptron Stochastic Gradient Descent for Logistic Regression Confidence-Weighted Learning 15

Which online algorithm? Perceptron Stochastic Gradient Descent for Logistic Regression Confidence-Weighted Learning 15

Perceptron [Rosenblatt, 1958] Convergence result: + radius Number of mistakes ≤ margin + +

Perceptron [Rosenblatt, 1958] Convergence result: + radius Number of mistakes ≤ margin + + − − + − Update on each mistake: 16

Logistic Regression with SGD [Bottou, 1998] Log likelihood: For every example: where Proportional 17

Logistic Regression with SGD [Bottou, 1998] Log likelihood: For every example: where Proportional 17

Confidence-Weighted Learning [Dredze et al. , 2008] [Crammer et al. , 2009] Maintain Gaussian

Confidence-Weighted Learning [Dredze et al. , 2008] [Crammer et al. , 2009] Maintain Gaussian distribution over weight vector: Constrained problem: Closed-form update: Treat features differently 18

Which online algorithms? Perceptron 19

Which online algorithms? Perceptron 19

Which online algorithms? Perceptron LR w/ SGD Proportional update helps 20

Which online algorithms? Perceptron LR w/ SGD Proportional update helps 20

Which online algorithms? Perceptron LR w/ SGD Confidence-Weighted Proportional update helps Per-feature confidence really

Which online algorithms? Perceptron LR w/ SGD Confidence-Weighted Proportional update helps Per-feature confidence really helps 21

Batch. . . B a t c h Fresh data helps More data helps

Batch. . . B a t c h Fresh data helps More data helps 22

Batch vs. Online B a t c h Confidence-Weighted Fresh data helps More data

Batch vs. Online B a t c h Confidence-Weighted Fresh data helps More data helps Online matches batch 23

Conclusion Detecting malicious URLs Relevant real-world problem Successful application of online learning Confidence-Weighted vs.

Conclusion Detecting malicious URLs Relevant real-world problem Successful application of online learning Confidence-Weighted vs. Batch As accurate More adaptive Less resources Future work Scaling up for deployment 24