DATA MINING MACHINE LEARNING FINAL PROJECT Group 2

Outline � � � Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference

Experiment setting � Selected online corpus: enron � Removing html tags � Factoring important

Feature Extration 1. 2. 3. 4. 5. 6. 7. Transmitted Time of the Mail

Transmitted Time of the Mail & Number of the Receiver Spam: Non-uniform Distribution Spam:

Probability of being Spam for Transmitted Time & Receiver Size

Attachment, Images, and URL Spam Ham Attachment 0. 0307% 7. 3712% Image 0. 6816%

Symbols in Mail Titles � Title Absentness � Spam � senders add titles now.

Mail-body � � � Build the internal structure of words Use a good NLP

Naïve Bayes Given a bag of words (x 1, x 2, x 3, …,

Vector Space Model Create a word-document (mail) matrix by SRILM. For every mail (column)

KNN (Vector Space Model) As K = 1, the KNN classification model show the

Maximum Entropy l Maximize the entropy and minimize the Kullback-Leiber distance between model and

SVM Binary : Select binary value {0, 1} to represent that this word appears

Single-layered-perceptron Hybrid Model The accuracy of NN-based Hybrid Model is always the highest.

Committee-based Hybrid-model The voting model averages the classification result, promoting the ability of the

Conclusion � 7 features are shown mail type discrimination. � Transmitted Time & Receiver

Reference � � [1]. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A

Slides: 22

Download presentation

DATA MINING & MACHINE LEARNING FINAL PROJECT Group 2 R 95922027 李庭閣 R 95922034 孔垂玖 R 95922081 許守傑 R 95942129 鄭力維

Outline � � � Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference

Experiment setting � Selected online corpus: enron � Removing html tags � Factoring important headers � � Six folders from enron 1 to enron 6. Contain totally 13496 spam mails & 15045 ham mails

Outline � � � Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference

Feature Extration 1. 2. 3. 4. 5. 6. 7. Transmitted Time of the Mail Number of the Receiver Existence of Attachment Existence of images in mail Existence of Cited URLs in mail Symbols in Mail Title Mail-body

Transmitted Time of the Mail & Number of the Receiver Spam: Non-uniform Distribution Spam: Only Single Receiver

Probability of being Spam for Transmitted Time & Receiver Size

Attachment, Images, and URL Spam Ham Attachment 0. 0307% 7. 3712% Image 0. 6816% 0% URL 30. 779% 7. 0521%

Symbols in Mail Titles � Title Absentness � Spam � senders add titles now. Arabic Numeral : � Almost � equal probability (Date, ID) Non-alphanumeric Character & Punctuation Marks: Appear more often in Spam Marks Probability of Feature Showing being Spam Mail Rate ~ ^ | * % [] ! ? = 0. 911 28% in spam /; & 0. 182 16% in ham Appear more often in ham

Mail-body � � � Build the internal structure of words Use a good NLP tool called Treetagger to help us do word stemming Given the stemmed words appeared in each mail, we build a sparse format vector to represent the “semantic” of a mail

Outline � � � Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference

Naïve Bayes Given a bag of words (x 1, x 2, x 3, …, xn), Naïve Bayes is powerful for document classification.

Vector Space Model Create a word-document (mail) matrix by SRILM. For every mail (column) pair, a similarity value can be calculated.

KNN (Vector Space Model) As K = 1, the KNN classification model show the best accuracy.

Maximum Entropy l Maximize the entropy and minimize the Kullback-Leiber distance between model and the real distribution. l The elements in word-document matrix are modified to the binary value {0, 1}.

SVM Binary : Select binary value {0, 1} to represent that this word appears or not Normalized : Count the occurrence of each word and divide them by their maximum occurrence counts.

Outline � � � Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference

Single-layered-perceptron Hybrid Model The accuracy of NN-based Hybrid Model is always the highest.

Committee-based Hybrid-model The voting model averages the classification result, promoting the ability of the filter slightly. However, sometimes voting might reduce the accuracy because of misjudgments of majority. 1. Knn + naïve Bayes + Maximum Entropy 2. naïve Bayes + Maximum Entropy + SVM

Outline � � � Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference

Conclusion � 7 features are shown mail type discrimination. � Transmitted Time & Receiver Size � Attachment, Image, and URL � Non-alphanumeric Character & Punctuation Marks � 5 populous Machine Learning are proved suitable for spam filter � Naïve � Bayes, KNN, SVM 2 Model combination ways are tested. � Committee-based & Single Neural Network

Reference � � [1]. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A Bayesian Approach to Filtering Junk EMail, " in Proc. AAAI 1998, Jul. 1998. [2] A plan for spam: http: //www. paulgraham. com/spam. html [3]Enron Corpus: http: //www. aueb. gr/users/ion/ [4]Treetagger: http: //www. ims. unistuttgart. de/projekte/corplex/Tree. Tagger/Decision. Tree. Tagger. html � [5]Maximum Entropy: http: //homepages. inf. ed. ac. uk/s 0450736/maxent_toolkit. html � � [6]SRILM: http: //www. speech. sri. com/projects/srilm/ [7]SVM: http: //svmlight. joachims. org/