Spam Not any more Detecting spam emails using

Importance of the topic u u Spam is unsolicited and unwanted emails Wastage of

Input Features – Data Set Original data set: 57 input attributes u Output attribute:

Preprocess the data Choose only the inputs which differ for spam and non-spam mails

MLP Implementation Learning by back propagation algorithm u Using complete data set u •

Cross Validation u Using reduced data set (Inputs – 9) • Good performance (Classification

Inference of the results u u Larger number of inputs does not necessarily improve

Conclusion u u Neural networks are a viable option in spam filtering A number

Slides: 8

Download presentation

Spam? Not any more !! Detecting spam emails using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan

Importance of the topic u u Spam is unsolicited and unwanted emails Wastage of bandwidth, storage space and most of all, recipient’s time Goals of the Anti-spam Network u u u Reliably block spam mails Should not block any non-spam mails, but can allow few spam mails to slip through Adapt to the specific types of messages

Input Features – Data Set Original data set: 57 input attributes u Output attribute: 1 (for spam) 0 (for nonspam) u Inputs derived from email content u Attributes indicate the frequency of specific words and characters u Examples: ‘credit’, ‘free’ (in spam) ‘meeting’, ’project’, (in nonspam) u

Preprocess the data Choose only the inputs which differ for spam and non-spam mails u Two reduced data sets are obtained (21 Inputs and 9 Inputs) u The data is made zero mean, unit variance (4025 Input Vectors) u Split the data into two independent training and testing data sets u

MLP Implementation Learning by back propagation algorithm u Using complete data set u • Poor performance (Classification rate: 63. 2%) • Classified most of the mails as non-spam u Using reduced data set (Inputs – 21) • • • Good performance (Classification rate: 93. 8%) All the non-spam is detected Optimal MLP Configuration: 20 -10 -10 -10 -7

Cross Validation u Using reduced data set (Inputs – 9) • Good performance (Classification rate: 92. 1%) • Nearly all the non-spam is detected • Optimal MLP Configuration: 20 -10 -10 -8 u Using Cross - Validation • • Negligible improvement in performance Since all the data is derived from the same source, cross validation offers no advantage

Inference of the results u u Larger number of inputs does not necessarily improve the performance It is important to remove redundant and irrelevant features There is no optimum MLP configuration for all inputs – need to adapt depending on the email content A combination of other types of spam filters along with neural networks can be used

Conclusion u u Neural networks are a viable option in spam filtering A number of heuristic methods are being increasingly applied in this field Need to exploit the differences between spam and ‘good’ emails Further opportunities • Data sets from different sources need to be used for training • Fuzzy logic and combinational algorithms can be used in this application