Filtering Spam With Justin Mason Spam Assassin Project

What Is Spam? • Best description: "Unsolicited Bulk Email" • In human terms: bulk

Why Bother Filtering Spam? • Seems to be about 30% to 60% of mail

Spam Volume Is Increasing (data from Brightmail. com)

Filtering: Homebrew Blacklists • First round of "spam filters": internal blacklists, maintained by in-house

Filtering: DNS Blacklists • Identify spam source computers by IP address • Allow mail

Spam. Assassin Concepts • Zero-configuration where possible • Lots of rules to determine if

Spam. Assassin Concepts, pt. 2 • Combines many systems for a "broadspectrum" approach: –

Integration Into Mail Systems • Wrote Spam. Assassin with flexibility of integration in mind

Accuracy and False Positives • The big issue with filtering to date: – not

Evolving a Better Filter • Spam. Assassin assigns scores using a genetic algorithm –

False Positive Rate • Spam. Assassin is 98. 5% accurate on our test corpora,

What To Do When You've Caught It • Since classifiers are imperfect, blind deletion

Features For Large-Scale Use: "spamd" • Client-server interface to Spam. Assassin • Pre-loads, so

Large-Scale Filtering For Your Network • • Different from filtering for yourself Many users

How Can Network Administrators Fight Spam? • Scan for Open Relays & Proxies on

How Do The Spammers Feel? • Already hurting, according to CBS: – “[I’ve gone

Future Directions • Learning filters (Bayesian probability etc. ) – Learn automatically, to detect

Fin • http: //spamassassin. org/ – Spam. Assassin for UNIX – (free software) •

Slides: 20

Download presentation

Filtering Spam With Justin Mason, Spam. Assassin Project & Deersoft http: //Spam. Assassin. org/

What Is Spam? • Best description: "Unsolicited Bulk Email" • In human terms: bulk e-mail you didn't want, and didn't ask for • Mailing lists, newsletters, "latest offers": not spam, if you asked for them in the first place • Name courtesy of Monty Python: “spam, spam and spam”

Why Bother Filtering Spam? • Seems to be about 30% to 60% of mail traffic, and increasing • Users are forced to waste time wading through their inbox – costs their employers money • Impossible to unsubscribe – “unsubscribe” addresses work only 37% of the time, according to the FTC • Legal retaliation not possible, yet • Just plain irritating!

Spam Volume Is Increasing (data from Brightmail. com)

Filtering: Homebrew Blacklists • First round of "spam filters": internal blacklists, maintained by in-house admin staff • Match addresses, and delete those from known spammers • Later, match "bad words" (Viagra, porn) • Quite hard to configure; centralised; lots of work to keep up to date

Filtering: DNS Blacklists • Identify spam source computers by IP address • Allow mail system to look up a public database on the internet as mail arrives • Block the message, if its sender's address is blacklisted • Now at least 20 DNS blacklists, with varying reliability • Many false positives – eircom. net's main mail server!

Spam. Assassin Concepts • Zero-configuration where possible • Lots of rules to determine if a mail is spam or not – "Fuzzy logic": rules are assigned scores, based on our confidence in their accuracy – These are combined to produce an overall score for each message – If over a user-defined threshold, the mail is judged as spam • No one rule, alone, can mark a mail as spam

Spam. Assassin Concepts, pt. 2 • Combines many systems for a "broadspectrum" approach: – Detect forged headers – Spam-tool signatures in headers – Text keyword scanner in the message body – DNS blacklists – Razor, DCC (Distributed Checksum Clearinghouse), Pyzor • Spammers cannot aim to defeat 1 system; the others will catch them out

Integration Into Mail Systems • Wrote Spam. Assassin with flexibility of integration in mind • Many have been written: – Integration into Mail Transfer Agents (sendmail, qmail, Exim, Postfix, Microsoft Exchange) – Integration into virus-scanner MTA plug-ins (MIMEDefang, amavisd-new) – IMAP/POP proxies and clients – Commercial plug-ins for Windows clients (Eudora, MS Outlook) • And many more I don't know about!

Accuracy and False Positives • The big issue with filtering to date: – not just “how much spam does it catch? ” – but “how many legitimate mails get caught, too? ” • Many systems do not pay attention to this problem – Some blacklists even use "false positives" as a weapon against service providers selling to spammers • FPs are much worse than spam getting through – much more inconvenient to user

Evolving a Better Filter • Spam. Assassin assigns scores using a genetic algorithm – Given a big collection of human-classified mail, determine what tests each mail triggers – Use this to "evolve" an efficient score set – Exactly the kind of problem a genetic algorithm is good at – Allows "shotgun" rules to be scored low, where they cannot do damage

False Positive Rate • Spam. Assassin is 98. 5% accurate on our test corpora, with default settings – 0. 6% false positives – 91% of all spam caught correctly – with network tests on, spam hit-rate probably increases to about 93 -95% • Highest rate available among present tools • Tunable by the user -- reduce FPs by increasing the threshold, ditto vice-versa

Effect of the Threshold Setting

What To Do When You've Caught It • Since classifiers are imperfect, blind deletion is bad • Better to mark the mails, and allow user to check over them infrequently • Also good to mark for legal reasons – In the UK, it may be illegal to hold mail (even spam) for more than 3 days

Features For Large-Scale Use: "spamd" • Client-server interface to Spam. Assassin • Pre-loads, so much faster for high volumes • Can load user preferences from an SQL database • Can load-balance -- uses TCP/IP • Deployed at several large organisations and ISPs: The Well, Salon. com, Panix, Transmeta, Source. Forge, Stanford

Large-Scale Filtering For Your Network • • Different from filtering for yourself Many users get little spam Should use conservative settings Better to use “opt-out by default” – notify that spam filtering is available, and ask them if they want it

How Can Network Administrators Fight Spam? • Scan for Open Relays & Proxies on your network • Block proxy ports at the firewall • Audit web servers for “Form. Mail” or other insecure web-to-mail scripts • Spam traps reporting to network blacklists: Razor, DCC, Pyzor • Run Spam. Assassin, or Spam. Assassin Pro!

How Do The Spammers Feel? • Already hurting, according to CBS: – “[I’ve gone through] unbelievable hardships [to keep spamming]. . . My operating costs have gone up 1, 000% this year, just so I can figure out how to get around all these filters” • Spam relies on low overheads and extremely cheap delivery • Disrupt the equation and they will give up!

Future Directions • Learning filters (Bayesian probability etc. ) – Learn automatically, to detect what "good" mail to your network looks like • "Hash-cash" – Sending mail currently more-or-less free – With hash-cash, each recipient requires CPU time for the sender – Spam. Assassin can provide "bonus points" for hash-cash users

Fin • http: //spamassassin. org/ – Spam. Assassin for UNIX – (free software) • http: //www. deersoft. com/ – Spam. Assassin Pro: MS Outlook, Exchange – (commercial version) – (my employers!)