SpamTunnel

A statistical spam filter working as a desktop email proxy.SpamTunnel

Back in university when I managed to improve my command of the Java language I decided to implement a larger scale project in order to stimulate my learning and to combine some other ideas and technologies to make an actual useful program. Given the times it had to be something connected to the internet and to do with statistics as those were the areas of interest at the time.

One way or another (probably influenced by the large amount of SPAM that was accumulating in my inbox) I decided to devise a method for filtering SPAM in the most convenient way possible.

The software that I sent out to build was  expected to perform the following:

  • Run client side and sit between your email client and the server
  • Intercept and translate the whole POP3 protocol
  • Process incoming email and mark it as SPAM/non SPAM

The software was ultimately called SpamTunnel and was formed of the several modules:

  1. The Communication module
  2. The statistics module(text processing)
  3. The learning module

The Communication module

This module implements the whole POP3 protocol. All the requests from your usual email client where proxied trough this module and passed along to your true email server. The purpose of this module is to intercept normal email flow and pass all emails to the statistics module before feeding it to the email client. It was also the purpose of this module to change the subject line or the headers of the document with the SPAM rating. The module was implemented form scratch, at that time I either did not believe in APIs or I really wanted to know the POP3 protocol by heart (which I do to this day).

SpamTunnel conceptAt a later stage the communication module also implemented the SMTP protocol in order to intercept outgoing email also. Outgoing email was intercepted and used by the training module to train the database used by the statistics module.

The Statistics module

The purpose of this module was to guess/categorize each received email in SPAM and non SPAM. The statistics module was relying on a database of words and had weights indicating he likelihood of that word to appear in a SPAM email or in a non SPAM email (the database had a “GOOD” word table and a “BAD” word table). Any new email that was received was processed word by word and compared against the two tables. The final score was decided using some for of average between the good and the bad score. Of course the average could be tweaked to have a bias towards marking emails as SPAM or non SPAM.

At a later stage I have changed the implementation of the GOOD/BAD table to work with pairs of words instead of individual words. It has proven to increase the accuracy of the detection quite significantly.

The learning module

The learning module was meant to train the system by creating the GOOD and BAD tables that the Statistics module is using. The learning module expected the user to manually classify his email in an initial phase in GOOD and BAD email. The email prepared in this way could be passed to the software either by placing it in special directories or by forwarding the email to a special address that was intercepted by the training module. Additionally the training module was using each email that was classified to by the Statistics module to further improve the GOOD/BAD word tables.

The algorithm used by the learning module was quite primitive being based only on the frequency of words in a  given email.

Results

Java Hashtables are remarkably fast, probably 10x faster in string scenarios than any other Java class (at least what I was aware at the time).

the software was quite accurate, identifying about 99.5% of the SPAM with very little legitimate email identified as SPAM (false positives), typically around 0.1%. By bringing the false positives to 0 the software was still filtering effectively above 80% of the time. Although it was somewhat cumbersome to use, I decided to release it on freshmeat as free software under the version 0.1. The software was downloaded a few hundred times but I don’t really know if anyone really decide to keep it long term. I personally used it for experimental purposes for a few months only. The software was never meant to succeed but it was successful in ways I never imagined.

What was interesting is that it got me in contact with some interesting people around the word, from university professors in the US to MIT media lab students inquiring about the inner workings of the software. I considered it a remarkable achievement at the time.

Interestingly enough the software was adopted by a large number of software forums. Even now, around 7 years after release date Google has about 1000 results on the subject, long after the software became unavailable.

More interesting is that the software made it’s way in a few interesting papers:

Paul Graham and Bayesian filtering

Paul Graham published his seminal article “A Plan for SPAM” around August 2002 so way before my piece of software. Please read it, it is beautifully written and explains in great detail a similar but better method for filtering SPAM. While my software is using the same basic principles for filtering, my statistical and learning algorithm was way less advanced at the time when it was implemented. I suspect I got to read the article only much later however I do not claim paternity over the idea.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>