Spam Detection Jingrui He 10082007 Spam Types o

  • Slides: 40
Download presentation
Spam Detection Jingrui He 10/08/2007

Spam Detection Jingrui He 10/08/2007

Spam Types o Email Spam n o Blog Spam n o Unsolicited commercial email

Spam Types o Email Spam n o Blog Spam n o Unsolicited commercial email Unwanted comments in blogs Splogs n Fake blogs to boost Page. Rank

From Learning Point of View o Spam Detection n o Feature Extraction n o

From Learning Point of View o Spam Detection n o Feature Extraction n o Classification problem (ham vs. spam) A Learning Approach to Spam Detection based on Social Networks. H. Y. Lam and D. Y. Yeung Fast Classifier n Relaxed Online SVMs for Spam Filtering. D. Sculley, G. M. Wachman

A Learning Approach to Spam Detection based on Social Networks H. Y. Lam and

A Learning Approach to Spam Detection based on Social Networks H. Y. Lam and D. Y. Yeung CEAS 2007

Problem Statement o n Email Accounts Sender Set: ; Receiver Set Labeled Sender Set:

Problem Statement o n Email Accounts Sender Set: ; Receiver Set Labeled Sender Set: s. t. o Goal o o n Assign the remaining account with in

System Flow Chart

System Flow Chart

Social Network from Logs o o Directed Graph Directed Edge n o Email sent

Social Network from Logs o o Directed Graph Directed Edge n o Email sent from Edge Weight to = is the number of emails n sent from to

System Flow Chart

System Flow Chart

Features from Email Social Networks o In-count / Out-count n o The sum of

Features from Email Social Networks o In-count / Out-count n o The sum of in-coming / out-going edge weights In-degree / Out-degree n The number of email accounts that a node receives emails from / sends emails to

Features from Email Social Networks o Communication Reciprocity (CR) n The percentage of interactive

Features from Email Social Networks o Communication Reciprocity (CR) n The percentage of interactive neighbors that a node has The set of accounts that sent emails to The set of accounts that received emails from

Features from Email Social Networks o Communication Interaction Average (CIA) n The level of

Features from Email Social Networks o Communication Interaction Average (CIA) n The level of interaction between a sender and each of the corresponding recipients

Features from Email Social Networks o Clustering Coefficient (CC) n Friends-of-friends relationship between email

Features from Email Social Networks o Clustering Coefficient (CC) n Friends-of-friends relationship between email accounts Number of connections between neighbors of Number of neighbors of

System Flow Chart

System Flow Chart

Preprocessing o Sender Feature Vector n n o Weighted Features n Problematic?

Preprocessing o Sender Feature Vector n n o Weighted Features n Problematic?

System Flow Chart

System Flow Chart

Assigning Spam Score o Similarity Weighted k-NN method n Gaussian similarity n Similarity weighted

Assigning Spam Score o Similarity Weighted k-NN method n Gaussian similarity n Similarity weighted mean k-NN scores n Score scaling The set of k nearest neighbors

Experiments o o Enron Dataset: 9150 Senders To Get n n n o Legitimate

Experiments o o Enron Dataset: 9150 Senders To Get n n n o Legitimate Enron senders: email transactions within the Enron email domain 5000 generated spam accounts 120 senders from each class Results Averaged over 100 Times

Number of Nearest Neighbors

Number of Nearest Neighbors

Feature Weights (CC)

Feature Weights (CC)

Feature Weights (CIA)

Feature Weights (CIA)

Feature Weights (CR)

Feature Weights (CR)

Feature Weights o In/Out-Count & In/Out-Degree n o The smaller the better Final Weights

Feature Weights o In/Out-Count & In/Out-Degree n o The smaller the better Final Weights n n In/Out-count & In/Out-degree: 1 CR: 1 CIA: 10 CC: 15

Conclusion o Legitimacy Score n o o Can Be Combined with Content-Based Filters More

Conclusion o Legitimacy Score n o o Can Be Combined with Content-Based Filters More Sophisticated Classifiers n o No content needed SVM, boosting, etc Classifiers Using Combined Feature

Relaxed Online SVMs for Spam Filtering D. Sculley and G. M. Washman SIGIR 2007

Relaxed Online SVMs for Spam Filtering D. Sculley and G. M. Washman SIGIR 2007

Anti-Spam Controversy o o Support Vector Machines (SVMs) Academic Researchers n n o Practitioners

Anti-Spam Controversy o o Support Vector Machines (SVMs) Academic Researchers n n o Practitioners n n o Statistically robust State-of-the-art performance Quadratic in the number of training examples Impractical! Solution: Relaxed Online SVMs

Background: SVMs o o Data Set = Class Label : 1 for spam; -1

Background: SVMs o o Data Set = Class Label : 1 for spam; -1 for ham Classifier: Tradeoff parameter To Find and Slack variable n Minimize: n margin the loss function Constraints: Maximizing the. Minimizing

Online SVMs

Online SVMs

Tuning the Tradeoff Parameter C o Spamassassin data set: 6034 examples Large C preferred

Tuning the Tradeoff Parameter C o Spamassassin data set: 6034 examples Large C preferred

Email Spam and SVMs o o TREC 05 P-1: 92189 Messages TREC 06 P:

Email Spam and SVMs o o TREC 05 P-1: 92189 Messages TREC 06 P: 37822 messages

Blog Comment Spam and SVMs o o Leave One Out Cross Validation 50 Blog

Blog Comment Spam and SVMs o o Leave One Out Cross Validation 50 Blog Posts; 1024 Comments

Splogs and SVMs o o Leave One Out Cross Validation 1380 Examples

Splogs and SVMs o o Leave One Out Cross Validation 1380 Examples

Computational Cost o Online SVMs: Quadratic Training Time

Computational Cost o Online SVMs: Quadratic Training Time

Relaxed Online SVMs (ROSVM) o Objective Function of SVMs: o Large C Preferred n

Relaxed Online SVMs (ROSVM) o Objective Function of SVMs: o Large C Preferred n o Minimizing training error more important than maximizing the margin ROSVM n n Full margin maximization not necessary Relax this requirement

Three Ways to Relax SVMs (1) o Only Optimize Over the Recent p Examples

Three Ways to Relax SVMs (1) o Only Optimize Over the Recent p Examples n Dual form of SVMs n Constraints The last value found for when

Three Ways to Relax SVMs (2) o Only Update on Actual Errors n Original

Three Ways to Relax SVMs (2) o Only Update on Actual Errors n Original online SVMs o n Update when ROSVM o o Update when m=0: mistake driven online SVMs NO significant degrade in performance Significantly reduce cost

Three Ways to Relax SVMs (3) o Reduce the Number of Iterations in Interative

Three Ways to Relax SVMs (3) o Reduce the Number of Iterations in Interative SVMs n n n SMO: repeated pass over the training set to minimize the objective function Parameter T: the maximum number of iterations T=1: little impact on performance

Testing Reduced Size

Testing Reduced Size

Testing Reduced Iterations

Testing Reduced Iterations

Testing Reduced Updates

Testing Reduced Updates

Online SVMs and ROSVM o ROSVM: Email Spam Blog Comment Spam Splog Data Set

Online SVMs and ROSVM o ROSVM: Email Spam Blog Comment Spam Splog Data Set