Glog

How to Beat Bayes

Spam is an adaptive virus: we only see the successes, as more and more filtering wipe out the less adaptive versions. Lately, I've been seeing an increasing amount of spam that's passed through three layers of filtering, two of them involving Bayesian notions of word frequency. This new spam has a bunch of randomly created word-length text strings. The subject lines have punctuation introduced in strange places so that the words are legible, but they don't "read" as words. (Of course, an easy parsing solution is to normalize words and then run filters against them.)

Obviously, this is the latest end-run around the latest spam innovation. It shows that Bayesian filtering, while a wonderful idea, has its limits because of spammers' cleverness and adaptability.

Ultimately, these exercises show that no matter what algorithm we use, spam will still filter through. (I'm still seeing Nigerian variants, which amazes me.) The next approach is going to be digital certificate-based: you can't forge those, and you prevent non-trusted sources from connecting. If you put certificates on the mail servers -- and make sure that VeriSign isn't the only company controlling the issuing of these certificates, but that non-profits and other organizations can be root certificate authorities -- then only mail servers configured with them will be able to exchange email with other servers.

It'll be tricky, but I believe the next change in the net will come that way. Technology and legislation aren't stopping spam. Digital certificates could dramatically reduce it because of the ability to revoke certificates, eliminating an entire mail server from a system without requiring a blacklist. (Yeah, and then who decides to revoke certificates? And on and on.)

Bayesian filtering has become the hot new thing in fighting spam. But as Glenn Fleishman writes, the spammers are adapting. Read More

GlennLog The next approach is going to be digital certificate-based: you can't forge those, and you prevent non-trusted sources from connecting. Ok, fine. Digital certificates are good. But how do you decide who's "trusted"? And why is that process any... Read More

The strategy of filling up messages with random words is not really new, or effective. Every "Bayesian" classifier of my acquaintance looks at a subset of statistically interesting words in the message. Made-up words (being new to the corpus) are not interesting and thus don't figure into the calculation.

In HTML email, it's possible to break up real words with HTML comments. Any decent filter will strip out HTML comments to glue bisected words back together for exactly this reason.

Mispunctuated or otherwise disfigured words are also not a viable spammer survival strategy. As Paul Graham put it [1], "'c0ck' is far more damning evidence than 'cock', and Bayesian filters know precisely how much more."

[1] http://www.paulgraham.com/spam.html

You've been pushing the digital-signature solution for quite some time [2], but I still don't see it in the cards. SpamSieve [3] is sufficiently effective that I don't worry about spam any more -- a few get through in a given week, while thousands don't.

[2]
http://blog.glennf.com/2001/03/25.html
http://blog.glennf.com/mtarchives/000557.html

[3] http://c-command.com/spamsieve/index.shtml

Even if I accept that current strategies are so flawed as to suggest a spam deluge in the offing, it's hard to get motivated about a solution that requires every email user, every sysadmin, and every trivial mail-sending perl script on the entire planet to upgrade to software that doesn't yet exist. The scope of your proposal is ambitious beyond precedent.

A permanent spam solution has to find its own tipping point, or it's dead on arrival. I'm responsible for two mail servers (one of them on the Xserve that sits directly above isbn.nu, oddly enough) and to endorse a top-down solution like a PKI is to dictate the software choices of the correspondents of my mail servers' users. I'd feel obliged to fire myself.