Fighting Spam with SpamBayes

Do you hate spam? I do. I receive hundreds of spam emails every day! I have tried several methods to stop the flood — Outlook rules, procmail regex filters, black/white lists, and others. None of them proved clever enough. After trying SpamBayes, I am convinced that Bayesian filters can kill spam.

I installed the SpamBayes plug-in for Outlook 2002. I fed it just 10 spam emails and 10 legitimate emails for training. After that short session, the filter was already working remarkably well, and I was finally able to disable Outlook’s clumsy built-in rules.

But there was still a problem. I had to download all messages and filter them locally. It would be far better to catch the spam on the server side before it ever reached my inbox. So I decided to set up a server-side filter. Fortunately, SpamBayes integrates nicely with procmail.

Our email system runs on the latest stable release of Postfix. I pay close attention to security, so SMTPs, POP3s, IMAPs, and HTTPS have all been configured for sending, receiving, and reading email. To enable SpamBayes, I simply added a few rules to the procmailrc file. Now every incoming message passes through SpamBayes first and gets delivered to the appropriate folder: obvious spam lands in “Spam Certain,” borderline messages go to “Spam Unsure,” and legitimate mail stays in the Inbox.

SpamBayes’ accuracy depends on having a sufficient training sample. I have to teach it what counts as spam and what does not. For this purpose, there are two special folders in my mailbox: “Filter Train/Ham” and “Filter Train/Spam.” I wrote a small script that trains SpamBayes from these two folders and empties them after a successful run. Whenever a message is misclassified, I simply copy it into the appropriate training folder. SpamBayes learns from the correction and handles similar messages correctly the next time.

Everyone has a different idea of what constitutes spam, so we maintain individual spam databases for each user.

Now the people using our email system can pull only clean messages to their local mailboxes via POP3s, and train the filter through IMAPs or the webmail. It saves time, bandwidth, and money.

Next, I am looking for a better anti-spam solution that works at larger scale — for instance, at OnlineNIC Inc., the company I currently work for. It hosts thousands of virtual hosts and email boxes. The solution needs to be intelligent, effective, and customizable, and it must be able to filter both incoming and outgoing emails.

Resources, SpamBayes Will Filters Kill Spam? OnlineNIC Inc.