Dale_Whiteaker-Lewis@Dell.com
Dale_Whiteaker-Lewis@Dell.com
Thu Aug 29 20:55:23 UTC 2002
It's obvious that I need to give a little better description of the situation as I've configured it, I apologize that I did not do that in the first place. I have used and plan to use dccm to simply mark the headers of bulk messages. Then every message if fed to a separate Milter (MIMEDefang), which runs all messages under a certain size (and matching a few other criteria) through SpamAssassin. A local SpamAssassin rule (in addition to the 400+ canned rules) use the DCC X- headers as input and add too the SA score for messages marked as bulk by DCC. This prevents bulkness from being the sole measure of spamishness, which is a hard and fast requirement at my location. If the MIMEDefang/SA combination rejects the message as spam, all original recipients are removed by MIMEDefang (and turned into X- headers of their own, a privacy concern at this point) and a single recipient is substituted in which is mailertable'd to be forwarded to a cluster of spam quarantine boxes. Procmail is used on those boxes to sort the messages onto disk based on the first few characters of the name of the first original recipient. This may seem cryptic, but results in the successful diversion of all mail scored by DCC, MIMEDefang and Spam Assassin to a central quarantine. The procmaillog is then used as an index of the quarantine area, for searching purposes. My idea below is not to change anything about this system except for the procmail process near the end of this Rube Goldberg machine to have access to the actual hash value that exceeds the threshold set in the DCC client, and to use that hash value as the filename for the quarantined file. The procmaillog file will preserve the message's original recipients (due to a local modification I made) and the message's sender, but may point to the same file as 100's of other procmaillog entries. This ultimately saves a great deal of disk space. This is the modification I was trying to describe below. I hope that is a more trenchant (if not more concise) telling of my tale. Dale Whiteaker-Lewis Network Engineer Dell Computer -----Original Message----- From: Vernon Schryver [mailto:vjs@calcite.rhyolite.com] Sent: Thursday, August 29, 2002 11:29 AM To: Dale_Whiteaker-Lewis@Dell.com; dcc@calcite.rhyolite.com Subject: Re: Looking for critique of idea for local integration of DCC and SA > From: Dale_Whiteaker-Lewis@Dell.com > ... > If, upon classifying a message as "bulk", DCC (through dccm) were to > mark the headers with the acutal hash that exceeded the threshold (not > sure that's feasible), the hash itself could be used as the filename > in quarantine. This would have the advantage of continually > overwriting a single copy of the bulk message, rather than > quarantining thousands of near-identical copies. Why would I go to > these lenghts? If a message were seen as bulk, yet was business > critical, a single copy of it would exist in the quarantine and could > be searched for and retrieved using data in the procmaillog file. > This occurs to me as one way to provide most of the benefit of DCC to > my network infrastructure with the assurance that no data would be > lost. Messages that did not exceed any threshold would be stored > individually. Unless the majority of your mail is spam, wouldn't that last sentence imply that most of your message storage is spent on legitimate mail? Why store messages that do not exceed a bulk threshold instead of delivering them (and so storing them in mailboxes)? Dccproc is likely to be significantly more expensive than dccm. I wouldn't be surprised if a busy SMTP server would need to be extremely muscular to apply SpamAssassin to every message. The dccm log files contain at most the first 30 KBytes of the body. A server dealing with 1,000,000 messages/day would need fewer than 30 GBytes of storage per day. So why not just store everything using dccproc or dccm log files? (Should there be yet another option to dccm to remove that 30K log limit?) (dccproc records the entire message.) If you did store a single copy of bulk mail how do you deal with the privacy concerns of letting people know who else got the message? Sometimes the "blind" part of "bcc" is important to people. (That is why dccm creates separate per-user log files for each addressee of a single message with many RCPTs, and why sendmail does not put the envelope addressees in Received headers when there are more than one.) Where would you record all of the addressees for the single copy of a message that arrives in separate SMTP transactions? Vernon Schryver vjs@rhyolite.com
More information about the DCC
mailing list