SpamAssassin/DCC integration

Craig R Hughes craig@hughes-family.org
Wed May 1 22:26:11 UTC 2002


Vernon Schryver wrote:

VS> > On 05/01/02 Craig R Hughes uttered the following other thing:
VS> > > Hearing good reports from cutting edge SpamAssassin users who've installed
VS> > > recent CVS builds and activated the new DCC/SA integration stuff.
VS> > >
VS> > > And already we're seeing the first person who's concerned with DCC performance,
VS> > > and wants to run their own local dccd.  So here's my question: how easy would it
VS> > > be to re-implement the DCC client side in perl (haven't looked at the DCC source
VS> > > yet), so that SpamAssassin doesn't have to fork a dcc client process for each
VS> > > message it processes, and doesn't have to create a copy of the message text to
VS> > > pipe into the dcc process, etc.  I'd be neat (and probably a heck of a lot
VS> > > faster) to do the DCC client side stuff in perl, and just call it directly from
VS> > > the main SA code instead of forking and piping.  Anyone looked at doing this
VS> > > before?
VS>
VS> The performace problem in that particular case cannot be related to
VS> forking or anything else that might be changed on the client side,
VS> because the volume of mail involve is not large.  I also doubt that
VS> running a DCC server locally would help this particular client speed
VS> problem.  If it is painful to do the equivalent of at most 2 DNS
VS> transactions per message, then shoveling 6-10 MByte/day of the full
VS> spam database over the wire is likely to be too painful to imagine.

I understand that the current performance issue is a network latency issue.
However, now that DCC is in SA, you may shortly have several hundred ISPs trying
to use DCC -- current SA users who upgrade to SA 2.30 (next release, circa June
1st) plus new SA users.  This is s substantially large volume of email;  not
sure what volumes DCC is seeing today without all these high-volume SA users.
Some of these people push in the million-emails-per-day ballpark.  They will
want their own DCC servers locally of course, but they'll also care a lot about
squeezing performance from the setup.  What I'm trying to do is see if we can
get pro-active about this imminent problem.

VS> If done more often than once an hour, the DCC client transaction is one
VS> UDP round trip or about the same as one DNS transaction where the right
VS> DNS server is already known.  An SMTP server typically does at least 2
VS> DNS transactions, one for the reverse DNS lookup of the SMTP client IP
VS> address, a second for the forward DNS lookup of the reverse name, a third
VS> for the domain name in the Mail_From command, and additional DNS lookups
VS> for each DNS blacklist.  Each DNS lookup involves at least one UDP round
VS> trip, and up to 3 or even more UDP round trips if the system must ask a
VS> root DNS server, a TLD server (e.g. .com), and then server for example.com.
VS> If done less often than once an hour, the DCC client transaction is one
VS> extra DCC round trip plus some fuzz to measure the RTT to other servers.
VS> `cdcc info` will tell you how slow the DCC transaction is.
VS> I guess `time nslookup` or `time dig` would approximate the speed of DNS.

These ISPs I imagine run their own DNS caching servers locally, and already have
the DNS lookup bandwidth built into their capacities.  The difference to them
between what they're doing now (assuming they're running SA 2.20) is that with
2.30 if they have DCC turned on, they will have to fork a dccproc process for
every message they process, and SA will have to create an in-memory copy of that
message to pass to the separate-memory-space dccproc process.  Potentially
considerably less resource-intensive in such a setup would be to pass the
existing copy of the message that SA is using to a perl implementation of
dccproc which is in the same perl VM -- no copying the message unnecessarily,
and no forking overhead.

VS> Rewriting the code from C to Perl would certainly not be my first thought
VS> for improving CPU or disk speed.  I also bet forking and exec'ing a C
VS> program is probably a lot faster than forking and exec'ing the Perl (or
VS> any) interpretor to run a Perl version of the DCC client code.

The perl VM is already loaded though, since the message is being passed through
SA.  I agree the forking overhead is probably pretty low -- the bigger concern
is the overhead of copying and piping the message between processes.

VS> To directly answer the question, porting the DCC client code to Perl
VS> would be non-trivial.

Ok, I figured that was probably the case given the arguments against porting
made above :)

VS> > What would probably be better would be to just make a "library" version
VS> > of the dcc client... and then you can wrap that library for whichever
VS> > scripting language you might want.

Yeah, I guess I'll take a look at that.  Something I'm working on though is
getting SA to work nicely on Windows -- not sure how easy it'd be to port the
DCC code to windows, given that I still haven't been able to get it compiling
and linking right under Mac OSX, and I had one or two minor issues before I
could get it to compile cleanly on my linux box (which is a pretty nonstandard
distro).

VS> > Actually, dcc is already mostly a library... it would just involve some
VS> > documentation of the library API...
VS>
VS> The big cost is not the documenting but the freezing of the interface.
VS> For example, I've had to whack at the whitelist library code to
VS> accomodate the fancier locking needed for per-user dccm whitelists.
VS> (Each whitelist can be used by multiple processes, and each process
VS> can involve threads.  The hash file for each whitelist must be maintained
VS> automatically, and without stalling more threads or processes than
VS> absolutely necessary....in other words, maybe there are good reasons
VS> why the sendmail automatic /etc/mail/alias updating is being deprecated.

I think what SA would be really happy with is just two functions:
dcclookup(message)  and dccsubmit(message), or maybe just one function
dcclookupandsubmit(message), since we handle whitelisting anyway already in SA.
So we'd really only need to freeze a pretty straightforward sub-section of the
overall library API, which probably wouldn't require a lot of changes.

VS> >                                      remember that dcc actually does all
VS> > of the work on the client side, the server only gets and receives
VS> > checksum information, which means a version written in perl or whatever
VS> > would have to re-write each of the checksum routines, etc.
VS>
VS> The checksum routines would be a pain.  Another pain would be all of
VS> the code that maintains the shared map file of DCC server IP addresses,
VS> IDs, passwords, and round trip times.

Yeah.  I'm guessing the latter is probably more of a pain and that the
checksumming stuff is more straightforward?  As far as the shared map file, IDs,
passwords, round trips, etc -- that's less of an issue in the environment I'm
describing, where you're going to be talking to a local DCC server, and you
really don't care (within SA) about talking to the fastest one, there are no
password issues, etc.  If the local server is sync'ing to the outside world's
DCC servers, well it's still the same old C code all other DCC servers are
using, so there's no problem there.

I'm ultimately thinking about how this would work for Hotmail, assuming they
wanted to use SA with DCC turned on.  They process about 80,000 messages *per
second* at peak times, and the less extra hardware they'd have to purchase to
handle SA and DCC, the better.

C




More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.