Looking for critique of idea for local integration of DCC and SA

Dale_Whiteaker-Lewis@Dell.com Dale_Whiteaker-Lewis@Dell.com
Thu Aug 29 20:55:23 UTC 2002


	It's obvious that I need to give a little better description of the
situation as I've configured it, I apologize that I did not do that in the
first place.  I have used and plan to use dccm to simply mark the headers of
bulk messages.  Then every message if fed to a separate Milter (MIMEDefang),
which runs all messages under a certain size (and matching a few other
criteria) through SpamAssassin.  A local SpamAssassin rule (in addition to
the 400+ canned rules) use the DCC X- headers as input and add too the SA
score for messages marked as bulk by DCC.  This prevents bulkness from being
the sole measure of spamishness, which is a hard and fast requirement at my
location.  
	If the MIMEDefang/SA combination rejects the message as spam, all
original recipients are removed by MIMEDefang (and turned into X- headers of
their own, a privacy concern at this point) and a single recipient is
substituted in which is mailertable'd to be forwarded to a cluster of spam
quarantine boxes.  Procmail is used on those boxes to sort the messages onto
disk based on the first few characters of the name of the first original
recipient.   This may seem cryptic, but results in the successful diversion
of all mail scored by DCC, MIMEDefang and Spam Assassin to a central
quarantine.  The procmaillog is then used as an index of the quarantine
area, for searching purposes.  
	My idea below is not to change anything about this system except for
the procmail process near the end of this Rube Goldberg machine to have
access to the actual hash value that exceeds the threshold set in the DCC
client, and to use that hash value as the filename for the quarantined file.
The procmaillog file will preserve the message's original recipients (due to
a local modification I made) and the message's sender, but may point to the
same file as 100's of other procmaillog entries.  This ultimately saves a
great deal of disk space.  This is the modification I was trying to describe
below.

I hope that is a more trenchant (if not more concise) telling of my tale.


Dale Whiteaker-Lewis
Network Engineer
Dell Computer

-----Original Message-----
From: Vernon Schryver [mailto:vjs@calcite.rhyolite.com] 
Sent: Thursday, August 29, 2002 11:29 AM
To: Dale_Whiteaker-Lewis@Dell.com; dcc@calcite.rhyolite.com
Subject: Re: Looking for critique of idea for local integration of DCC and
SA


> From: Dale_Whiteaker-Lewis@Dell.com

> ...
> 	If, upon classifying a message as "bulk", DCC (through dccm) were to

> mark the headers with the acutal hash that exceeded the threshold (not 
> sure that's feasible), the hash itself could be used as the filename 
> in quarantine.  This would have the advantage of continually 
> overwriting a single copy of the bulk message, rather than 
> quarantining thousands of near-identical copies.  Why would I go to 
> these lenghts?  If a message were seen as bulk, yet was business 
> critical, a single copy of it would exist in the quarantine and could 
> be searched for and retrieved using data in the procmaillog file.  
> This occurs to me as one way to provide most of the benefit of DCC to 
> my network infrastructure with the assurance that no data would be 
> lost.  Messages that did not exceed any threshold would be stored 
> individually.

Unless the majority of your mail is spam, wouldn't that last sentence imply
that most of your message storage is spent on legitimate mail? Why store
messages that do not exceed a bulk threshold instead of delivering them (and
so storing them in mailboxes)?

Dccproc is likely to be significantly more expensive than dccm. I wouldn't
be surprised if a busy SMTP server would need to be extremely muscular to
apply SpamAssassin to every message.

The dccm log files contain at most the first 30 KBytes of the body. A server
dealing with 1,000,000 messages/day would need fewer than 30 GBytes of
storage per day.  So why not just store everything using dccproc or dccm log
files? (Should there be yet another option to dccm to remove that 30K log
limit?) (dccproc records the entire message.)

If you did store a single copy of bulk mail how do you deal with the privacy
concerns of letting people know who else got the message? Sometimes the
"blind" part of "bcc" is important to people. (That is why dccm creates
separate per-user log files for each addressee of a single message with many
RCPTs, and why sendmail does not put the envelope addressees in Received
headers when there are more than one.)

Where would you record all of the addressees for the single copy of a
message that arrives in separate SMTP transactions?


Vernon Schryver    vjs@rhyolite.com





More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.