SPAM Filter

SPAM Filtering

I used the free program, Spamihilator, for email filtering with good results when I was with Earthlink but stopped using it after switching to Charter email because the SPAM level was very low. SPAM slowly increased on my Charter email accounts until, after a year, it exceeded my (relatively low) tolerance level. I re-installed Spamihilator in December 2005 and was pleasantly surprised by how much it had matured in the year I had not used it.

Spamihilator installation for OE is now automatic. I simply ran the installer and it figured out all it needed to know to insert itself as a proxy so it could filter all three of my POP3 email accounts for OE. Bayesian filtering of text in the message body is automatically included as is a "SPAM words" filter and a whitelist of friends which can be automatically initialized from XP's address book. There are a number of plug-in filters which can be installed by simply checking them off on the Plug-Ins page of the Settings window. The only further thought required is to decide which plug-ins to add to achieve the desired level of filtering.

Spamihilator is a heuristic filter of sorts in that it allows the user to select from a list of over 30 plug-ins to build a unique SPAM filter suitable for their specific situation. The Bayesian filtering included in Spamihilator is fairly weak because it classifies only words in the message body, ignoring the header information.

Spamihilator is well behaved in that it passes messages from Whitelist friends directly to OE but copies all emails it classifies to its Training Area where messages can be re-classified and/or submitted to the Bayes learning filter. False positives may be reclassified and copied to OE's inbox. Even if a false positive is sent to Spamihilator's Recycle Bin it can still be restored and sent to OE's inbox, right up until it's user settable lifetime expires.

My SPAM Filtering Philosophy

Each user's SPAM is different so there isn't a one-size-fits-all choice for which Spamihilator plug-ins to use. I had about half a dozen particularly obnoxious SPAM types that would appear repeatedly in my email so my goal was to add filters until at least 95% of the SPAM was eliminated.

My thought was that most of my good email could easily be identified as being in my Friends Whitelist or containing a keyword from my WhiteStrings plug-in. Thus, the mail I am (fairly) certain is not SPAM isn't subjected to SPAM filtering - so there is no chance of a false positive error with these emails since Spamihilator immediately transfers it to OE's Inbox.

I added a few plug-in filters to eliminate obvious types of SPAM and then studied the SPAM that got through these simple filters. The outstanding thing about SPAM that eluded the simple filters was that it was mostly HTML, gif, or Base64 encoded and/or it contained a peculiar value in one or more of the lines of information about the message used by the mail system for routing or by OE for display control. The interesting thing about these structural items is that they can't easily be spoofed because the mail system needs them.

Since Bayes statistical filtering became popular, SPAMmers use HTML, GIFs, JPGs, etc. more frequently and also add strings of random text to foil statistical filtering. The Bayes approach is mostly a semantic type of filtering in that it depends on word frequency in the overall message without consideration of the context or syntax. A blacklist is another frequently used filter so SPAM often uses spoofed header info and injects SPAM into the mail system via unorthodox methods; ISP's often add info to the header which helps detect these unorthodox methods and this info may be useful in filtering.

That is, SPAMmers take measures to avoid common filtering techniques and these measures often cause peculiarities in the resulting SPAM. Certain peculiarities are seen in the structures required by the mail system and some of these are detectable.

I used Spamihilator's XHeader filter to detect the syntax of some SPAM characteristics and declare messages which use this syntax to be SPAM. My approach was to look at SPAM that was not caught and add an XHeader rule to catch that specific SPAM. As I added rules to XHeader a surprising result occurred: not only did these rules eliminate the SPAM which eluded the other filters I was using, these rules also detected SPAM which had previously been detected by other filters! With only 6 XHeader rules, the other filters were seldom catching SPAM because XHeader caught it first.

Using XHeader

The breakthrough in filtering for me was realizing that the XHeader filter could recognize syntax in a rudimentary way. Like many of the Spamihilator plug-ins, the description was terse and didn't indicate its real capability. XHeader seems to select lines in a message by the leading text on the line and then allows classifying messages based on whether the selected line has some specified characteristic (typically contains some text). By selecting a line based on the leading text before applying a rule, XHeader provides some simple syntax checking; i.e. finding a line beginning with "Received" that contains the phrase "localhost" denotes SPAM but "localhost" is ignored in lines that begin differently.

The rules added to XHeader to eliminate SPAM that eluded my original filters are:

In adding rules to XHeader I copy strings from SPAM messages and paste them into XHeader rules -- this avoids errors from typos (which are difficult to find when a rule doesn't work). XHeader does not indicate which rule caused it to identify a message as SPAM so one must get the rules right without this type of helpful feedback.

The first rule I used caught one common symptom of injecting mail using a spoofed address: the accepting machine tags it with "from unknown". This is not foolproof - I saw a couple false positives but they were from peculiar sources so the rule worked well for several months. When I changed ISP's the new mail system tagged all mail as "from unknown" so this rule no longer worked. I changed the rule to detect "localhost" -- not as commonly seen but no false positives so far.

The "iso-8859" rule catches SPAM where the encoding used makes the Subject line difficult to filter for those who use a list of SPAM words via OE's rules. Hard to say if it is needed, it was added before the HTML filtering rules and may now be superfluous.

The "image/gif" rule catches some unusual SPAM that uses an inline GIF rather than HTML and similarly, the "base64" rule catches some unusual SPAM.

The "multipart/alternative" rule catches some SPAM that seems to use forwarding to somehow hide the "text/html" from XHeader.

My new ISP checks the SPF and inserts results into the header so I added a rule to catch mail that fails this SPF check; still watching this one to see how it does...

SPAM evolves so things could change but for now this simple filter setup does what I need and very seldom produces a false positive. I keep other filters in place as backup, of course. The XHeader rules I use have some overkill built in so SPAM messages often violate several XHeader rules simultaneously, making the rules less sensitive to minor changes in SPAM tactics.

This filtering will produce false positives if people not in your Friends whitelist send Rich Text Format (HTML) messages that don't include Whitestring words; so far, this hasn't happened to me but your situation may be different so use caution for a while if you try these rules. Add rules one at a time and use them for a bit before adding the next rule - check carefully for false positives while trying new rules, of course.

XHeader Problems

XHeader has not been thoroughly debugged so it has some peculiarities in its description and in its operation. The author, Bernhard Maehr of Linz, Austria isn't actively supporting this valuable filter so it is what it is and users must accept that. While I can think of several enhancements, this is only an exercise in frustration without the author to support it.

The XHeader plug-in has a very terse description that does not do it justice: "Filters Mail depending on X-Header lines". By experimenting with XHeader I found that it is far more capable than the description implies. Initially, I assumed it would only filter header items that begin with "X-" but tried other things to see what would happen. My experiments showed that it selected lines based on the first characters on the line and that the lines selected need not be in the header, where I took header to mean prior to the "Subject" line. XHeader seems to select based on the first characters of lines located anywhere in the message, except that some messages which may be spoofed as Forwarded seem to avoid selection of lines within the Forward section. XHeader seems to treat lines which begin with blanks as an extension of the previous line which began with a non-blank. These opinions about how XHeader parses a message are based on interpretation of observed behavior and are subject to misunderstanding on my part - however, the success of the rules above supports these conclusions.

One XHeader problem discussed in the Spamihilator English Forum (Spamihilator's author, Michel Kramer, is from Germany) is that if "Subject contains Re:" is part of a rule, XHeader will declare all messages where the Subject contains Re to be SPAM. I ran into this with one of my early rules but later found that much broader rules worked well and also reduced the number of rules necessary.

A more serious problem is that, unlike other Spamihilator plug-ins, XHeader does not indicate which rule it used to decide that an email is SPAM. This makes debugging of new rules more difficult because the email must be opened and scanned for items by hand. This also makes it difficult to determine which rules are most important vs which rules are no longer needed -- as implied above, further experiments may show some of the 6 rules are not necessary.

Finally: XHeader occasionally crashed on startup, apparently based on the number or size of rules. It crashed several times when I tried 8 rules but worked fine with the rules above (other users report no problem with 40 rules). The Hercule filter caught the SPAM on the occasions when XHeader crashed. XHeader crashes about every 6 weeks on my system and sometimes takes Spami out too; occasionally XP dies when this happens so this is not a benign occurrence.

Spamihilator Filters and Miscellaneous Details

The list of Spamihilator filters and their priority in my system is shown in this screenshot:

X-Header catches the majority of the SPAM I receive, other filters, especially Hercule, provide backup in case XHeader crashes. It is critical that filters used to identify non-SPAM appear prior to XHeader. The Friends list is first by default in Spami so the critical thing is that Whitestrings precede XHeader else many false positives will occur. Addressee is my first filter only because it is simple so it is very fast and has never produced a false positive for me, YMMV.

My Whitestrings filter has only 10 words and passes all the email from people who write to me about my web sites - I was surprised how well this worked. I also added about the same number of words to the SPAM Words list to catch a few SPAMs that snuck through in the early going. I doubt these are needed now that XHeader is working so well.

One item which would be very helpful in evaluating filter changes is a more convenient way in Spamihilator of accumulating sample SPAM, then re-submitting these as a group to test a new filter configuration. While this is possible with a virtual POP3 server, it is awkward enough that I avoid it as much as possible.

Overall, Spamihilator is a great program that allows users to quickly and easily tailor its filtering to pass email from friends and remove the specific SPAM they receive. The XHeader filter is an exceptionally powerful SPAM filter whose capability is often not appreciated.

If you have a comment on this site or its contents, click here, scroll down and click again.