CPECSC 481 KnowledgeBased Systems Dr Franz J Kurfess
CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly © 2003 Franz J. Kurfess Spam Filtering 1
Course Overview u Introduction u Knowledge u Semantic Nets, Frames, Logic u Reasoning u with Uncertainty Probability, Bayesian Decision Making u Expert u and Inference Predicate Logic, Inference Methods, Resolution u Reasoning u Representation System Design u CLIPS u Overview Concepts, Notation, Usage u Pattern u Matching Variables, Functions, Expressions, Constraints u Expert System Implementation u Salience, Rete Algorithm u Expert System Examples u Conclusions and Outlook ES Life Cycle © 2003 Franz J. Kurfess Spam Filtering 2
Overview Spam Filtering u Motivation u Spam u Objectives u u Chapter u u Spam Terminology u Dealing u u Introduction with Spam Laws and Regulations Filtering via Keywords Filtering via Rules Learning © 2003 Franz J. Kurfess u and Bayes Binary Classification of Documents N-ary Classification u Implementation u u Spam. Bayes Project Related Projects u Important Concepts and Terms u Summary Spam Filtering 3
Logistics u u Introductions Course Materials u u textbooks (see below) lecture notes u u u handouts Web page u u u Power. Point Slides will be available on my Web page http: //www. csc. calpoly. edu/~fkurfess Term Project Lab and Homework Assignments Exams Grading © 2003 Franz J. Kurfess Spam Filtering 4
Bridge-In © 2003 Franz J. Kurfess Spam Filtering 5
Pre-Test © 2003 Franz J. Kurfess Spam Filtering 6
Motivation u dealing with spam “manually” is very time-consuming , tedious, and prone to errors u various methods have been tried to “filter” spam, with varying success u early results with Bayesian approaches look very promising © 2003 Franz J. Kurfess Spam Filtering 7
Objectives u be u u u to u u familiar with the terminology spam Bayesian approaches understand elementary methods for handling spam automatically more advanced methods scenarios and applications for those methods important characteristics v u to u u u to u differences between methods, advantages, disadvantages, performance, typical scenarios evaluate the suitability of approaches for specific tasks binary classification n-ary classification be able to apply Bayesian filtering spam © 2003 Franz J. Kurfess Spam Filtering 8
© 2003 Franz J. Kurfess Spam Filtering 10
Spam u broadly: u u any email that is not wanted by the recipient similar to paper “junk” mail easily recognized by recipients u unsolicited u u not requested by the recipients automatically sent out to a large number of recipients u “optional” u u characteristics disguised or forged sender, return addresses and email forwarding information questionable contents u u bulk email illegal, unethical, fraudulent, . . . hidden activities u acknowledgement of receipt, spyware (“Web bugs”), virus © 2003 Franz J. Kurfess Spam Filtering 11
u spam u u terms spam: negative (bad stuff) ham: positive (good stuff) u Filtering u unique word in a specific message sample or training set u u body of documents (email messages) hapax, hapax legomenon u u ham incorrectly classified as spam valid messages are blocked corpus u u spam incorrectly classified as ham spam “gets through” false positive u u terms false negative u u Terminology messages used to train the system test set u messages used to evaluate the system © 2003 Franz J. Kurfess http: //spambayes. sourceforge. net/ Spam Filtering 12
Filtering Spam u Keywords u Rules u Learning © 2003 Franz J. Kurfess Spam Filtering 13
Keywords u identify keywords that frequently occur in spam u simple and efficient all incoming messages are checked for the occurrence of these keywords u if a message contains any or several of them, it is blocked u the list of keywords can be modified easily u u not very accurate many false positives u legitimate messages that happen to include “forbidden” words u many false negatives u can be easily circumvented u u used tools in some early email filtering and Web blocking u little to moderate success © 2003 Franz J. Kurfess Spam Filtering 14
Rules u characteristics of spam messages are described through if. . . then rules u not u too complicated, moderately efficient characteristics can be combined u not only keywords u also formatting, headers u more accurate fewer false positives u allows a better description of spam messages u fewer false negatives u somewhat more difficult to circumvent u © 2003 Franz J. Kurfess Spam Filtering 15
Learning u samples of good (ham) and bad (spam) messages are given to the system before it is deployed u the system analyses various criteria, and tries to determine which criteria are most valuable for the distinction u used earlier for general email categorization u assignment of messages to folders u suggestion of actions to be performed (e. g. reply, delete, forward) u spam was not a problem at that time © 2003 Franz J. Kurfess Spam Filtering 16
Spam and Bayes u Binary u two u Classification of Documents bins: spam, ham u sometimes u N-ary Classification u uses u n bins “sure spam”, ”probably spam”, “maybe spam”, “unclear” “maybe ham”, “probably ham”, “sure ham” u Related u neural u an implicit “undecided” bin is used Approaches networks instead of Bayesian filtering essentially also uses statistical techniques © 2003 Franz J. Kurfess Spam Filtering 17
Binary Classification of Documents u documents u pieces are parsed, and tokens extracted of the message that may serve as classification criteria u determined by the developer u the number of occurrences for each token is calculated u done for two corpora: one ham, one spam u results spam in two tables with occurrences of tokens in ham and ua third table is created that reflects the probability of a message being ham or spam © 2003 Franz J. Kurfess Spam Filtering 18
Calculation of Probabilities u Tokenizer u Scoring u Training u Testing © 2003 Franz J. Kurfess http: //spambayes. sourceforge. net/ Spam Filtering 19
Tokenizer u breaks up a mail message into a series of tokens u usually words or word stems u sometimes complete phrases u may consider non-textual elements v u it message headers, HTML constructs, images, comments can be difficult to identify meaningful tokens u message body tokens u embedded URLs u message headers u correlation between different types of clues © 2003 Franz J. Kurfess http: //spambayes. sourceforge. net/ Spam Filtering 20
Scoring u assigns u u 0 definite ham 1 definite spam u most u v false positives false negatives unjustified confidence v u difficult and sensitive part of the system incorrect scores v u a number to each message scores are mostly close to 0 and 1, and rarely in between improvements through using two separate probabilities v v v ham probability spam probability allows better treatment of unknown cases as “unsure” v substantial reduction of false positives and false negatives © 2003 Franz J. Kurfess http: //spambayes. sourceforge. net/ Spam Filtering 21
Training u presentation of examples for ham and spam u generates the probabilities used by the scoring system to assign values to new messages u corpus size u usually the larger, the better too large may lead to overtraining the number of ham and spam examples should be roughly equal u corpus u u u quality representative samples are very valuable better quality can make up for lack of quantity avoid misleading cues v e. g. recent spam vs. old ham; tags added by the mail system © 2003 Franz J. Kurfess http: //spambayes. sourceforge. net/ Spam Filtering 22
Testing u messages categorized as ham or spam are used for testing the performance of the system u frequently the existing collection of categorized messages is divided into a training and a testing set u intuitive insights often don’t work well u HTML tags u exclamation marks in the header u MESSAGES WRITTEN IN CAPITALS u cross-validation u formal technique that systematically divides the corpus into various combinations of training and test sets © 2003 Franz J. Kurfess http: //spambayes. sourceforge. net/ Spam Filtering 23
Results u performance results are notoriously difficult to compare u message corpus u training methods u threshold v cut-off value for spam u “magic v numbers” parameters adjusted by the developer or user © 2003 Franz J. Kurfess Spam Filtering 24
Selected Results u based on Paul Graham’s article “Better Bayesian Filtering” [Graham, 2003 b] u 99. 75 filtering rate on 1750 messages over 1 month u 4 false negatives: spam got through usage of mostly legitimate words v neutral text with an innocent-sounding URL v u 3 false positives: ham got blocked newsletters sent through commercial emailers v almost spam v email that happens to have features typically associated with spam v ALL CAPITALS, <FF 0000>, in-line images, URLs v © 2003 Franz J. Kurfess Spam Filtering 25
Token Probabilities u based on Paul Graham’s article “Better Bayesian Filtering” [Graham, 2003 b] Subject*FREE 0. 9999 free!! 0. 9999 To*free 0. 9998 Subject*free 0. 9782 free! 0. 9199 Free 0. 9198 Url*free 0. 9091 FREE 0. 8747 From*free 0. 7636 free 0. 6546 © 2003 Franz J. Kurfess Spam Filtering 26
N-ary Classification u more than two categories u similar techniques as in the binary approach u can be substantially more complex © 2003 Franz J. Kurfess Spam Filtering 27
Related Approaches u collaborative filtering u many people categorize messages as spam, and submit them to a central system also should have ham samples u may “wash out” individual differences u u neural networks u similar concepts, but different learning methods © 2003 Franz J. Kurfess Spam Filtering 28
Implementation u Spam. Bayes Project [Spam. Bayes] u stand-alone filter u plug-in for some popular mail programs u Related Projects u Spam. Assassin u http: //spamassassin. org/ combines statistical techniques, rules, black-lists, collaborative filtering u see Paul Graham’s list of spam filters at http: //www. paulgraham. com/filters. html © 2003 Franz J. Kurfess Spam Filtering 29
Future Work u extension to more sophisticated tokens u phrases u letters v replaced by visually similar symbols e. g. o/0, l/1 u separators v inserted between characters spam -> s p a m, s-p-a-m u combination u blacklists, u genetic with other approaches whitelists, rule-based systems, . . . algorithms u construction © 2003 Franz J. Kurfess of filters through evolution Spam Filtering 30
References u [Graham, 2003 a] Paul Graham, A Plan for Spam. http: //www. paulgraham. com/spam. html, August 2002. u [Graham, 2003 b] Paul Graham, Better Bayesian Filtering. http: //www. paulgraham. com/better. html, January 2003. u [Spam. Bayes] Spam. Bayes : Bayesian anti-spam classifier written in Python. http: //spambayes. sourceforge. net/, visited Feb. 2003 & Riley 1998] u[Giarratano [Robinson, 2002] Gary Robinson's Rants: Spam Detection. http: //radio. weblogs. com/0101454/stories/2002/09/16/spam. D etection. html, Dec. 2002. [A revised version is to appear in the March 2003 issue of the Linux Journal, http: //www. linuxjournal. com/. © 2003 Franz J. Kurfess Spam Filtering 31
Important Concepts and Terms u u u u agenda backward chaining common-sense knowledge conflict resolution expert system (ES) expert system shell explanation forward chaining inference mechanism If-Then rules knowledge acquisition © 2003 Franz J. Kurfess u u u knowledge base knowledge-based system knowledge representation Markov algorithm matching Post production system problem domain production rules reasoning RETE algorithm rule working memory Spam Filtering 32
Summary Spam Filtering © 2003 Franz J. Kurfess Spam Filtering 33
© 2003 Franz J. Kurfess Spam Filtering 34
- Slides: 33