Identifying Suspicious URLs An Application of LargeScale Online
- Slides: 24
Identifying Suspicious URLs: An Application of Large-Scale Online Learning Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for ICML 2009 June 15, 2009
Detecting Malicious Web Sites URL = Uniform Resource • Safe. Locator URL? • Web exploit? http: //www. cs. mcgill. ca/~icml 2009/abstracts. html • Spam-advertised site? http: //www. bfuduuioo 1 fp. mobi/ws/ebayisapi. dll • Phishing site? http: //fblight. com http: //mail. ru Predict what is safe without committing to risky actions 2
Problem in a Nutshell URL features to identify malicious Web sites Different classes of URLs Benign, spam, phishing, exploits, scams. . . For now, distinguish benign vs. malicious facebook. com fblight. com 3
Today's Talk Problem Approach Learning to detect malicious URLs Challenges: scale and non-stationarity Evaluations Need for large, fresh training sets Online learning Conclusion 4
State of the Practice Current approaches Blacklists Learning on hand-tuned features Limitations Cannot learn from newest examples quickly Cannot quickly adapt to newest features Arms race: fast feedback cycle is critical More automated approach? 5
Live URL Classification System Label Example Hypothesis 6
Live Training Feed Malicious URLs (spamming and phishing) 6, 000— 7, 500 per day from Web mail provider Benign URLs From Yahoo Web directory Total of 20, 000 URLs per day Live collection since Jan. 5, 2009 Months of data Two million examples after 100 days 7
Feature vector construction http: //www. bfuduuioo 1 fp. mobi/ws/ebayisapi. dll WHOIS registration: 3/25/2009 Hosted from 208. 78. 240. 0/22 IP hosted in San Mateo Connection speed: T 1 Has DNS PTR record? Yes Registrant “Chad”. . . [__ … Real-valued 60+ features 000111… 1 0 Host-based 1. 8 million Day 100 1 Lexical 1. 1 million 1 …] GROWING 8
Live URL Classficiation System Online learning 9
Practical Challenges of ML in Systems Industrial concerns Scale: millions of examples, features Non-stationarity: examples change over time (arms race w/ criminals) Pivotal decision: batch or online? 10
Batch vs. Online Learning Batch/offline learning SVM, logistic regression, decision trees, etc Online learning Perceptron-style algorithms Multiple passes over data Single pass over data No incremental updates Incremental updates Potentially high memory and processing overhead Low memory and processing overheard Online learning addresses scale and non-stationarity 11
Evaluations Online learning for URL reputation Need for large, fresh training sets Comparing online algorithms 12
Need lots of fresh training data? SVM trained once SVM retrained daily Fresh data helps 13
Need lots of fresh training data? SVM trained once on 2 weeks SVM w/ 2 -week sliding window Fresh data helps More data helps 14
Which online algorithm? Perceptron Stochastic Gradient Descent for Logistic Regression Confidence-Weighted Learning 15
Perceptron [Rosenblatt, 1958] Convergence result: + radius Number of mistakes ≤ margin + + − − + − Update on each mistake: 16
Logistic Regression with SGD [Bottou, 1998] Log likelihood: For every example: where Proportional 17
Confidence-Weighted Learning [Dredze et al. , 2008] [Crammer et al. , 2009] Maintain Gaussian distribution over weight vector: Constrained problem: Closed-form update: Treat features differently 18
Which online algorithms? Perceptron 19
Which online algorithms? Perceptron LR w/ SGD Proportional update helps 20
Which online algorithms? Perceptron LR w/ SGD Confidence-Weighted Proportional update helps Per-feature confidence really helps 21
Batch. . . B a t c h Fresh data helps More data helps 22
Batch vs. Online B a t c h Confidence-Weighted Fresh data helps More data helps Online matches batch 23
Conclusion Detecting malicious URLs Relevant real-world problem Successful application of online learning Confidence-Weighted vs. Batch As accurate More adaptive Less resources Future work Scaling up for deployment 24
- Translates urls to ip address
- Adjective
- Whats an adjective clause
- Identifying and non identifying adjective clauses
- Suspicious dns query
- Suspicious mail training
- Bi rads category 4
- List all of the reasons why alby is suspicious of thomas.
- The author’s tone gives the mood of _______.
- Suspicious mail handling
- How to deal with suspicious customers
- Federal reserve
- Online platforms, tools and application
- Dcjs renewal
- Plmar mission vision
- Non-cmvs the applicant plans to operate
- Urs online application
- Labour market impact assessment lmia online web application
- Okta dmv ny
- Is independent variable x or y
- Examples of main idea
- Three parts of a recipe
- One third 1/3
- Mood n tone
- Tone of a story