Idea for fuzzy checksum

Vernon Schryver
Mon Apr 5 21:59:44 UTC 2004

> From: Andreas Schmitz 

> ...
> 1. save all words of past emails and how often they occur in spam or
> non-spam on every dcc-client (like it is done with bayesian filters)
> 2. then if new mail arrives at the client, go with a window of maybe
> 10 words over the body of the mail
> 3. calculate for every single window (15 words) the probability to be
> spam (with bayesian chain rule)
> 4. then generate a checksum of the window with the highest probability
> to be spam and report it to the dcc-server
> 5. get the current counter back and make your decision...

How do you avoid the false positives that are otherwise inevitiable
for Bayesian spam filters?   Two messages containing the same 15
word phrases can be quite distinct.  The obvious example is a message
reporting spam and prefixed with "please stop sending this junk"
or "please terminate your user that sent this garbage."

> ...
> The idea behind that is, that they can add a million random words, but
> they cant say 'click here to buy' in a million different phrases.

Are you trying to detect spam with nasty phrases or mail that is
bulk?  The DCC only tries to detect bulk mail.  Bayesian spam filters
try to detect spam with nasty words or phrases.  I doubt the two
goals are compatible.

If the idea is to produce a distributed Bayesian filter, then I think
it would be better to start with that as an explicit goal.  Perhaps
you could build a system that would share daily summaries of counts
words with highest positive and negative counts.

However, I do not see how you avoid vulnerabilities to data poisoning
by spammers.  Whether you share checksums of windows, word counts, or
anything with equivalent information, why can't spammers contribute
GBytes of bogus values?  Even if the distributed system counted only
signs of spam and not signs of legitimate mail, why couldn't spammers
contribute lots of reports of words, phrases, windows, or whatever
that would make practically any legitimate message appear to be spam?
Worse, how do you handle well meaning anti-spammers who misconfigure
their systems to report everything as spam?

The DCC defends against those tactics or mistakes by examining entire
(non-binary) messages and by saying "Go ahead.  The only mail you
can mark as bulk is mail that you have seen and so sent or received.
Unless it is bulk, no one else will see it and so you can't do
enough harm to worry about."

Vernon Schryver

More information about the DCC mailing list

Contact by mail or use the form.