State of the antispam regime

One of the advantages of hanging out in The Answer Gang is being privileged to hear when one of the Gangsters goes off on a rant (which are usually lots of fun), or, as often happens in the case of Rick Moen, fires off a mini-tutorial on some interesting subject. Recently, in response to a comment of mine, Rick posted a reply that we'd like to share with you, our readers; not only was it a bit long for our Mailbag, but it also deserved the status of a full article. Many thanks to Jimmy O'Regan for formatting the content; as with so many jobs around LG, "many hands make light work".
-- Ben Okopnik

Contents

State of the antispam regime
Description of the toolset
The downside of blackholing

State of the antispam regime

Ben wrote:

As to spam, you may or may not have noticed, but the incidence of it has gone down (my estimate) by ~99% since Rick Moen took over administering the list. It's been more than a year, and I don't think I've seen a dozen spams get through during that time.

Herewith, a brief report on the state of system spam-rejection.

As a reminder, TAG's setup is challenging, in that list policy precludes requiring subscription before posting. Thus, a prime antispam tool isn't available.

The most-effective antispam tool is use of Exim 4.60 (w/sa-exim 4.2)'s "callout" interface during incoming SMTP sessions, used to verify that the connecting MTA is RFC-compliant (e.g., accepts DSNs and mail to the postmaster and abuse accounts). This blocks an immense percentage of spam, and 55x-rejects those. After that, the mail is subjected to SpamAssassin 3.1 testing: High spamicity mail is 55x-rejected, very high-spamicity mail is 45x-rejected (i.e., teergrubing, which attempts to punish particularly flagrant malefactors by tempting them to continually reattempt delivery).

I use some but not all of the optional enhancements provided by J.P. Boggis's "EximConfig" package (http://www.jcdigita.com/eximconfig/), of which I have v. 2.0 installed:

The setup does not (yet) use greylisting, which might be beneficial.

I've been reluctant to make my MTA depend on MySQL.
The setup does not (yet) attempt to SA-test mail with full unescaping and Base64 decoding of the message body text, which might be beneficial.

I've been reluctant to make my MTA depend on embedded Perl and MySQL.
The setup does not (yet) attempt flood protection / duplicate messages / repeat failed deliveries.

I've been reluctant to make my MTA depend on MySQL.
The setup does not (yet) attempt direct detection of malware as distinct from other types of spam.

I've been reluctant to make my MTA depend on detectors of MS-Windows malware, which after all isn't actually harmful, only annoying -- and it seems lame to have to run a virus checker on a Linux box, even ClamAV.

Spam addressed to TAG since March 1 (to date) that evaded my filters was as follows:

One in Polish.
A lame bit of Windows malware (zipped) with a quasi-nonsense cover message.
A piece of Russian/Cyrillic spam
Five 419 (advance-fee) frauds
Four phishing frauds.
An old-fashioned UCE

In each case, I blackholed the delivering IP upon receipt -- but that's ultimately a poor solution: You end up endlessly playing whack-a-mole. Still, it makes sure there are no repeats from that IP.

I may finally overcome my reluctance to enable those above-cited optional EximConfig features, persuant to my long-deferred server migration to slightly less antique hardware, Real Soon Now. Also, I could attempt to update Exim and SA's rulesets, to help them pattern-match better on 419 and phishing frauds. (I gather that financial fraud spam has been really taking off.)

The point of the recounting, above, is to highlight where the major weaknesses in the current system lie: non-English-language spam, and financial fraud spam.

I try to be really, really careful in any experimentation with my MTA, since mail is very important to me and my users, and since I can't afford to lavish unplanned debugging time fixing a malfunctioning mail system. Thus, even though I know I really should upgrade to EximConfig 2.2, sa-exim 4.2.1, and SpamAssassin 3.1.1 -- and put in place some custom third-party rulesets for the latter (http://wiki.apache.org/spamassassin/CustomRulesets), I've been slow to do so for fear of breakage and extra time commitment.

Description of the toolset

[Description of the existing toolset: Exim 4.60, sa-exim 4.2, EximConfig 2.0, and SpamAssassin 3.1 -- along with its design goals, current weaknesses, and reasons why I've been reluctant to implement some optional extensions and in general am very careful about breakage.]

Warning: Following paragraphs include opinions, which you are welcome to either adopt, take home, and admire, or scowl at and hurl imprecations towards, as local cerebral wiring policy dictates. Readers looking for guaranteed objective truth should stick to mathematics -- and even then stay far away from anything touched by Kurt Gödel.

Spam defence is endlessly controversial for lots of reasons, including inherent drawbacks (false positives, false negatives, other types of collateral damage) in even the best regimes. No matter what tactics you use, you annoy someone -- and mailing lists (being glorified mail-forwarding devices) have proven to be Ground Zero for the spam problem and resulting controversies.

The "luv-main" mailing list, of the Linux Users of Victoria (Australia) has recently erupted in one such donnybrook, where the overwhelming majority (and pretty much all of the more-technical members) wish to convert the mailing list to publicly searchable archives, while a vocal minority stand in the way, protesting their right to continue hiding their mailing addresses from spammers. Many innocent electrons have been killed in the resulting discussion, but at least it was established to the satisfaction of most that the only reasons hide-from-spammers tactics "work for many people" (as a proponent put it) are short usage periods and dumb luck: Over time, any address used for mail will become discovered by spammers through any of several diverse means, including exposure on other people's virus-infected Windows boxes.

LUV's likely compromise solution will be creation of a fully public "luv" mailing list alongside the obscured "luv-main" one, with the expectation that the latter ghetto will wither and die. (Naturally, the minority aren't happy with that proposal, either.)

On the other side of the Pacific Ocean, the Silicon Valley Linux User Group's mailing lists are gradually becoming more spammy, despite using (modulo versions) exactly the same MTA and spam-rejection software I use, because SVLUG's Linux server has been completely unmaintained for 2+ years.

I'm not unsympathetic towards hide-from-spammers people: As they often point out, most use mail facilities over which they have no administrative control -- often, in fact, their work mailboxes. A deluge of spam would make their lives miserable and might even interfere with their professional lives. This perceived loss-of-control threat increases their stress levels immensely, and impels them to make sometimes emotional demands on listadmins and others. (I frequently get mail saying "I [/someone else] inadvertantly revealed my private e-mail address on mailing list $FOO. Would you mind please removing it from the public archive?" I always do help such people. Even though I don't share their general approach, it's the kind thing to do.)

In any event, I'd been pondering both the slow deterioration of SVLUG's spam-defences (from my front-row seat as its mailing list moderator) and my own system's occasional slippage on TAG mail -- e.g., the aforementioned twelve spams over the last six weeks: 5 advance-fee frauds, 4 phishing frauds, 1 in Russian, 1 Windows malware, and 1 UCE. As with SVLUG's system, I realised that I could probably do better, after a bit of updating.

So, a couple of days ago, I did some. You'll maybe have noticed that we've had absolute, blissful silence on the spam front, since then -- which might be coincidence, or maybe not.

I figured out that one easy path to the low-hanging fruit was: beef up the SpamAssassin rulesets. In an ideal world, this would not be my preferred approach: SA is a very beefy and slow Perl app, and those same heuristics, implemented as improvements to the Exim4 front-end rulesets would, mutatis mutandis, be much more desirable. However, those aren't at hand, while there's a veritable bazaar in third-party ruleset files for SA, right here:

http://wiki.apache.org/spamassassin/CustomRulesets

The ones I dropped into /etc/mail/spamassassin, a couple of days ago, were as follows:

70_zmi_german.cf     Catches German language spam.
Chinese_rules.cf     Rules to catch spams written in Chinese.
mime_validate.cf     Finds MIME errors common in mails sent by bulk mailers.
blacklist_ro.cf      Catches spams written in Romanian or by Romanian spammers.
evilnumbers.cf       Phone #s, PO boxes, & street addresses harvested from spam.
chickenpox.cf        Looks for words broken up by extraneous symbols.
french_rules.cf      Catches spams written in French.
malware.cf           Detects URLs known to point to malware.

There's a standard cronjob (Rules du Jour) to keep these and others updated; I haven't implemented that yet, as I'm still testing.

Initially, I also installed this one:

sa-blacklist:  a large set of blacklist entries of domains and IP addresses.

This turned out to an 11MB(!) ruleset file. For a Perl script. I rather recklessly did give it a try, resulting in the kernel out-of-memory killer going on a shooting spree (on my antique 256MB RAM PIII server) about five minutes later.

In addition, also quite usefully, I'm sure (if not more so than the SA improvements), I dropped in updated nine updated ACL files for the J.P. Boggis's EximConfig suite of Exim4 rules (and other things): http://www.jcdigita.com/eximconfig/#ACLs

...which brings us to the present, and the (for now) absence of new spam arrivals. Moral of the story: It's possible (at least, if you're an MTA operator) to have a credible, livable alternative to the venerable "hide from spammers" stance: Use better technology, apply it intelligently, and be aware that some ongoing maintenance is required.

Personally, I would regard it as beneath my dignity to do otherwise. Internet presence is my community's core competency, so I'll be damned if I'm going to surrender even an inch of it.

One caveat: Silence isn't necessarily blissful. There will always be collateral damage, and one must keep an eye out for addresses, IPs, hostnames, etc. that should be whitelisted.

The downside of blackholing

Jimmy Regan replied to TAG querent Marcin Niewalda:

> Myślę, że to pomyłka: pan napisał do listy adresowego magazynu
> internetowego. Dlatego że nasz magazyn jest napisany po angielsku,
> przetłumaczyłem e-mail Pana. Adres, którego Pan szukał, jest
> Delaveaux@heagmedianet.de ale myślę, że ten pan mowi tylko po
> angielsku i po niemiecku; a nie wiem, czy ten adres jest nadal
> aktualny.

Separately, I had commented:

> In each case, I blackholed the delivering IP upon receipt -- but
> that's ultimately a poor solution:  You end up endlessly playing
> whack-a-mole.  Still, it makes sure there are no repeats from that IP.

And there's another reason why it's a poor solution: Some days, you get trigger-happy, e.g., when an innocent, on-topic query arrives in Polish -- which is a problem if, unlike Jimmy, one is Polish-challenged.

Having seen the original query and (erroneously) concluded that it was spam, I absent-mindedly added the delivering host's IP to /etc/exim4/eximconfig/reject/ip via the vim instance I leave open all the time editing that file. It's a straight listing of "Individual full sender IP addresses to reject", one per line. Today, several days later, I had long ago lost track of which line it was. Fortunately, in this one instance, my error remains fixable:

[rick at linuxmafia]
~ $ dig -t mx okiem.pl +short
10 mail.okiem.pl.
[rick at linuxmafia]
~ $ dig mail.okiem.pl +short
72.232.62.58

Accordingly, I've just now un-blackholed poor Marcin's ISP mail server IP. But in many cases, I'd either not realise I'd misjudged, or be unable to re-find which IP it was.

Anyway, the larger point I wanted to make is that what I said in the three-line quotation, above, wasn't exactly right -- because I actually do try to be mindful of grey areas, and look closely at what the IP really is, before consigning it to permanent oblivion. (One doesn't want to rely on Received headers' hostnames in so doing, except for ones supplied by my own receiving MTA doing reverse-resolves: Spammers lie; "dig -x" is often one's friend.)

For example, if the prior-hop IP on a 419-fraud spam corresponds to hostname mx105.exampleisp.com, then blackholing it would be dumb: For one thing, impliedly exampleisp.com has at least 104 other mail exchangers. For another, I'm not necessarily eager to pronounce anathema on exampleisp.com just because of one 419 fraudmail. That could happen to almost anyone operating an MTA. It could happen to me (but only until I tracked down the user on my system who did it and... reasoned with him).

I'm reminded of a passage in my friend Karsten Self's (CC'd) excellent recent analysis paper "CIDR House Rules: Use of BGP Router Data to Identify and Address Sources of Internet Abuse"[1]:

While blocklisting is one possible option, I'd very much like to see the discussion move beyond that point. A preferred approach is what I term "proportionate response". First: you'll likely want rules to expedite known-trusted mail, or high priority mail from remote organizational sites, peers, clients, vendors, or other established relationships. Secondly, many peers will either have small overall volumes, or not have a clearly identifiable nature. This leaves the set of networks that are both high-volume and overwhelmingly spammy in nature. Of course, any such implementation would have to be evaluated in a business and organizational context.

In proportionate response, a certain level of abuse would be met by a proportionate level of response. For example, a network from which 90% of email was found to be spam, 90% of traffic originating from that network would be denied or dropped, either at the service (protocol) or IP level, at random. If done at the SMTP transaction level, either as a timeout (without 250 OK) or non-permanent rejection, this would mean legitimate mail still has a fighting chance to get through. A 90% reject rate would allow half of mail through on 5 retries, for a typical 2 hour delay. A spam server without retry rules would fail delivery of 90% of its mail; with retries, it would suffer large mail spools and possible other resource starvation. The site implementing such a policy will receive immediate benefit to itself. Widespread adoption is not necessary to be locally beneficial. As multiple and large sites adopt such measures, impacts on abuse-tolerant networks would be significant. The approach is to be both non-invasive and non-retaliatory. You are not taking any action that in any way directly changes or affects a remote system: but are subjecting it to a denial of interest. As a proportionate response, reject rates could vary with total traffic volume, abusive traffic percentage, and severity of abuse, as suited specific needs. Fine levels of control are therefore possible; operators are not reduced to all-or-nothing responses to abuse.

I don't yet have the toolsets to implement Karsten's excellent advice, though I admire its judicious approach. Lacking those, I mostly rely on the previously described RFC-compliance checking (implemented via Exim 4.x callouts), SPF-checking, tarpitting, and intentionally very sparse use of other lossy filtering. Playing whack-a-mole on spam-source IPs is mostly a losing game, with too much collateral damage.

Main exceptions are: (1) Some IPs whose owning domains / companies I've just seen much too often in that context, and accordingly have classed as evil. The French division of European broadband ISP Wanadoo is the example that most comes to mind: After seeing way, way too much blatant spam from IPs respolving to *.wanadoo.fr hostnames, I just started adding them as they spammed me to /etc/exim4/eximconfig/reject/ip. Legitimate French Wanadoo customers will increasiingly be SOL in sending mail to any address on my machine, which is a little unfair but life's imperfect.

It's basically a moral judgement on my part that Wanadoo should do much better, and that therefore it and its users can go to Hell. This isn't very nice of me, and possibly isn't a wise long-term measure, but it feels good. ;->

(2) IPs whose surrounding facts make them seem like IPs from pools of virus-infected Windows desktop machines being abused to crank out virusgrams, UCE, etc. One can reasonably guess that nothing from those IPs will ever be legitimate, for at least 3-5 year values of "ever".

It would be good if entreis in /etc/exim4/eximconfig/reject/ip entries (and similar files for related purposes) were at least date-stamped and would time out: They aren't and don't. My maintaining such a file manually is not very satisfactory and has obvious drawbacks. Over time, I hope to phase it out -- and maybe even pull Wanadoo out of its oubliette.

[1] http://linuxmafia.com/~karsten/cidr-house-rules.pdf Recommended.

Abstract: "BGP router data may be used to identify contiguous regions of network space from which significant abuse is observed. Experience suggests a strong power-law relationship in ranking such sources. Applying this knowledge in abuse countermeasures may markedly reduce filtering overhead while minimizing inadvertant blocking and increasing total costs to abuse-tolerant networks."

Talkback: Discuss this article with The Answer Gang

[BIO] Rick has run freely-redistributable Unixen since 1992, having been roped in by first 386BSD, then Linux. Having found that either one sucked less, he blew away his last non-Unix box (OS/2 Warp) in 1996. He specialises in clue acquisition and delivery (documentation & training), system administration, security, WAN/LAN design and administration, and support. He helped plan the LINC Expo (which evolved into the first LinuxWorld Conference and Expo, in San Jose), Windows Refund Day, and several other rabble-rousing Linux community events in the San Francisco Bay Area. He's written and edited for IDG/LinuxWorld, SSC, and the USENIX Association; and spoken at LinuxWorld Conference and Expo and numerous user groups.

His first computer was his dad's slide rule, followed by visitor access to a card-walloping IBM mainframe at Stanford (1969). A glutton for punishment, he then moved on (during high school, 1970s) to early HP timeshared systems, People's Computer Company's PDP8s, and various of those they'll-never-fly-Orville microcomputers at the storied Homebrew Computer Club -- then more Big Blue computing horrors at college alleviated by bits of primeval BSD during UC Berkeley summer sessions, and so on. He's thus better qualified than most, to know just how much better off we are now.

When not playing Silicon Valley dot-com roulette, he enjoys long-distance bicycling, helping run science fiction conventions, and concentrating on becoming an uncarved block.

Copyright © 2006, Rick Moen. Released under the Open Publication license unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

Published in Issue 127 of Linux Gazette, June 2006

<-- prev | next -->