Spam Email Detection Ethan Grefe December 13 2013
Spam Email Detection Ethan Grefe December 13, 2013
Motivation • • Spam email is constantly cluttering inboxes Commonly removed using rule based filters Spam often has very similar characteristics This allows them to be detected using machine learning • Naïve Bayes Classifiers • Support Vector Machines
SVM Solution • Used training data from CSDMC 2010 SPAM corpus • 4327 labeled emails • 2949 non-spam messages (HAM) • 1378 spam messages (SPAM). • Extracted features from the subject and body of emails • Used resulting feature vectors to train an SVM classifier in Matlab
Email Features • Features were determined by research and observation • Best results were obtained with the following features • • Percentage of letters that are capitalized Types of punctuation used Average length of a word Amount of html in the email
Classifier Results • Trained on a random 35% of emails • Tested SVM classifier on remaining 65% • Trained SVM using three different kernel functions Kernel Function Spam Classification Rate Ham Classification Rate Total Classification Rate RBF 80. 06% 92. 33% 86. 20% Linear 78. 69% 80. 66% 79. 67% Quadratic 82. 75% 84. 85% 83. 80%
Possible Improvements • Use Naïve Bayes to classify emails using word frequency • Obtain a wider variety of input features • Test other types of learning algorithms
- Slides: 6