Classifying and Filtering Spam Using Search Engines Oleg

  • Slides: 29
Download presentation
Classifying and Filtering Spam Using Search Engines Oleg Kolesnikov College of Computing Georgia Tech

Classifying and Filtering Spam Using Search Engines Oleg Kolesnikov College of Computing Georgia Tech 9/2003 1

>50% of all e-mail today is spam? 9/2003 Source: brightmail. com 2

>50% of all e-mail today is spam? 9/2003 Source: brightmail. com 2

Scale • IDC: of 31 bn messages sent each day, 18%, or 5. 6

Scale • IDC: of 31 bn messages sent each day, 18%, or 5. 6 bn were s[pc]am messages • Brightmail decoy network stats: 6. 7 bn spam messages sent in March, 2003, varying from 100 to ~100, 000 identical e-mails sent at a time 9/2003 3

Current techniques to deal with SPAM/UCE: • • Blacklisting Signature-based Filtering Statistical/Bayesian Filtering Heuristic

Current techniques to deal with SPAM/UCE: • • Blacklisting Signature-based Filtering Statistical/Bayesian Filtering Heuristic Filtering Challenge-Response Filtering Sender-pays Laws 9/2003 4

Blacklisting • MAPS (Mail Abuse Prevention System) RBL catches only 24% of spam with

Blacklisting • MAPS (Mail Abuse Prevention System) RBL catches only 24% of spam with 34% false positives (the spam police article, gaudi/gaspar) • Self-appointed sheriffs/vigilantes, legitimate business increasingly caught in crossfire, e. g. i. Bill was losing $100 k/day during each of the four days of blacklisting • Only a first cut at the problem, never b-lists more than 50% of the servers sending spam (Graham) 9/2003 5

Sample and Signature-based Filtering • Set up a network of DECOY e-mail addresses. Any

Sample and Signature-based Filtering • Set up a network of DECOY e-mail addresses. Any messages sent to these addresses must be spam=>if the same message is sent to a protected address, the message must be SPAM, too (that’s what Brightmail does) • Not very flexible -- spammers take the lead in coming up with tricks • Make each spam different 9/2003 6

Brightmail (used by MS/Hotmail, Earthlink, Verizon, ebay etc. ) 9/2003 7

Brightmail (used by MS/Hotmail, Earthlink, Verizon, ebay etc. ) 9/2003 7

Basic Statistical Filtering • • W: Must be TRAINED, S: relatively low false positives

Basic Statistical Filtering • • W: Must be TRAINED, S: relatively low false positives Starts with two message corpuses -- spam and legitimate Splits messages into TOKENs Assigns each token a probability, based on the probability of its appearance in spam corpus e. g. ‘naked’ may have 67% probability of appearing in spam, say vs. ‘regards’ -- 10% • when a new message arrives, stat filter takes top N tokens with the probability that is the farthest from the middle 50% both ways, applies Bayesian Theorem, and comes up with a RANKING for the e-mail 9/2003 8

Heuristic Filtering • What kind of filters can you come up with JUST BY

Heuristic Filtering • What kind of filters can you come up with JUST BY LOOKING at a spam e-mail? • Sender name looks bogus? • Header fields are missing? • Lots of html? • Take all these rules and heuristic observations, assign weights/points, and put them into a database • You’ve got yourself an early version of SPAMASSASSIN 9/2003 9

Spam. Assassin • The way you can make it work (let’s say with postfix):

Spam. Assassin • The way you can make it work (let’s say with postfix): 1) perl -MCPAN -e ‘install Mail: : Spam. Assassin’ 2) learn on database of spam and legitimate e-mails using sa-learn (part of spamassassin) 3) add a filter program to filter all incoming mail through spamc, a part of spamassassin: /usr/bin/spamc | /usr/sbin/sendmail -i “$@”; exit $? 4) spamc adds headers, something like: X-Spam-Flag: {YES|NO}, X-Spam-Level: *** 5) The headers are caught by a user’s procmail recipe and mail is classified appropriately 9/2003 10

Heuristic Filtering Two • W: Public heuristic rules database; makes it relatively easy for

Heuristic Filtering Two • W: Public heuristic rules database; makes it relatively easy for spammers to come up with way to bypass the system => The rules database needs to be updated frequently • May not be as effective today as other methods, such as stat filtering 9/2003 11

Challenge-Response Filtering • Whenever you receive an e-mail from someone NOT on your whitelist,

Challenge-Response Filtering • Whenever you receive an e-mail from someone NOT on your whitelist, an automatic reply is sent telling what steps the sender should take to be considered for the whitelist (e. g. send you a confirmation, make a donation, solve a puzzle, etc. ) • Very effective at stopping spam BUT has a number of drawbacks: valid mail delayed, kind of harsh -- some may think of it as inconsiderate and never reply, extra work for senders etc. 9/2003 12

Stats for different approaches (Message. Labs) 9/2003 13

Stats for different approaches (Message. Labs) 9/2003 13

Problems with Statistical and other keyword-dependent methods • 1) Heavily dependent on effective parsing

Problems with Statistical and other keyword-dependent methods • 1) Heavily dependent on effective parsing and the presence of “true” tokens, e. g. spammers fooling parsers: Examples: – White background: <font color=white>research data and other statistically strong keywords that are present in legitimate e-mails</font> – Splitting words: ch<!-- valid -->eck this p<!-- news -->orn – Adding extra characters and spaces to confuse parsers (F*R E-E) and so forth (javascript, fake html tags, browser-specific tricks) 2) • 2) Spam may contain too little text and be TOO close to real e-mails in keywords. This is a more serious problem. I’ll give an example later. 9/2003 14

My research • Developed and implemented a system for filtering of unwanted mail using

My research • Developed and implemented a system for filtering of unwanted mail using Google • Can be used WITHOUT training 9/2003 15

Classification of current spam 9/2003 16

Classification of current spam 9/2003 16

Thoughts • Some users must click on those ads or else there would be

Thoughts • Some users must click on those ads or else there would be no spam (somebody IS interested in it after all) • There may be more of such users in the future as new regulations appear and spam becomes less of an annoyance and more of an ad • Some users may like to receive SPAM-looking messages, for instance, marketing reports, offers, etc. , that look very much like spam 9/2003 17

Two main observations I use • Spam is USER-SPECIFIC • Most spammers expect users

Two main observations I use • Spam is USER-SPECIFIC • Most spammers expect users to TAKE some ACTION upon reading spam; in other words, there has to be a FEEDBACK mechanism 9/2003 18

Targeting the feedback mechanism • How effective would a spam be without an easy

Targeting the feedback mechanism • How effective would a spam be without an easy feedback mechanism? 9/2003 19

URLs as a feedback mechanism • Of ~1800 spam messages in the classical spam

URLs as a feedback mechanism • Of ~1800 spam messages in the classical spam corpuses I have analyzed, ~95% of messages contained URLs • Of the remaining 5%, approximately 1/2 seemed to be damaged submissions (i. e. MIME conversion and other types of errors), the rest consisted of two types of letters: – Messages with 1 -800 numbers and faxes (including Nigerian scam) – Religious letters 9/2003 20

Basic Approach: URLSP • The basic approach was to extract URLs, apply a user-specific

Basic Approach: URLSP • The basic approach was to extract URLs, apply a user-specific whitelist based on a user’s mailbox (masks such as. edu, cnn. com etc. ) and classify everything else as spam • The first version I implemented has been in use at Tech since December’ 02 • Has actually been working quite well 9/2003 21

Effective but rather naive • First version effective but rather naive • Granularity and

Effective but rather naive • First version effective but rather naive • Granularity and false positives can be a problem 9/2003 22

Next version: Classifying URLs • CLASSIFY URLs using Google and Open Directory • Use

Next version: Classifying URLs • CLASSIFY URLs using Google and Open Directory • Use whitelists/blacklists of categories and URLs BASED on user mailbox and individual preferences 9/2003 23

DMOZ/ODP 9/2003 24

DMOZ/ODP 9/2003 24

Example • Based on files automatically generated from your mailbox, configure the system as

Example • Based on files automatically generated from your mailbox, configure the system as follows (blacklist* f. are omitted): whitelist. url: . edu, . mil, . gov, www. nmap. com, www. epic. org, www. cypherpunks. to etc. whitelist. cat: Top/Computers/Security/Anti_Virus/Products Top/Computers/Security/Products_and_Tools/Cryptography/PGP Top/Computers/Security/Products_and_Tools/Password_Tools. . . 9/2003 25

URL Classifier: Categories Extracted from SPAM • Examples of categories of URLs extracted from

URL Classifier: Categories Extracted from SPAM • Examples of categories of URLs extracted from spam: Top/Business/Consumer_Goods_and_Services/Beauty/Cosmetics Top/Business/Employment/Careers Top/Business/Financial_Services/Mortgages Top/Business/Investing/Day_Trading/Brokerages Top/Business/Investing/Day_Trading/Education_and_Training Top/Business/Investing/News_and_Media/Newsletters/Stocks_and_Bonds Top/Business/Marketing_and_Advertising/Direct_Marketing/Mailing_Lists/MLM Top/Regional/North_America/Canada/Business_and_Economy/Employment/Job_Sear ch Top/Shopping/Gifts/Personalized Top/Shopping/Home_and_Garden/Kitchen_and_Dining/Appliances/Parts. . . 9/2003 26

GTUC v 1. 0 (Basic) • Register for a free account on a Co.

GTUC v 1. 0 (Basic) • Register for a free account on a Co. C-based filtering server • Forward your mail to the server • The mail will be automatically classified into three folders as it arrives – Inbox, Unknown, spam-can • Read your mail with IMAP 9/2003 27

Spam of the future • Innovative feedback mechanisms • Appearance as close to legitimate

Spam of the future • Innovative feedback mechanisms • Appearance as close to legitimate e-mails as possible, e. g. >>> From: rcarlos@legitimate. com Hi, here is an interesting article. You should check it out -- net: : “terminator_25” Roberto Carlos 9/2003 28

Solution • Current best--Combination of approaches • Categorization and URL-based filtering can help •

Solution • Current best--Combination of approaches • Categorization and URL-based filtering can help • Uncategorized URLs? Similarity + retrieval of html and categorization with token stats/heuristics 9/2003 29