Idea for fuzzy checksum

Andreas Schmitz andi@upjohns.org
Mon Apr 5 17:31:07 UTC 2004


Hi everyone,
I am new to the whole Spam thing and the list, but I have an idea for
a new fuzzy checksum. I cheked the archive and couldnt find anything
like that, correct me if someone already had this idea.

I saw that some people are complaining about the false-negatives
because of the great amount of personalizations (mostly random words)
in spam. I checked everything, but couldnt find anything about the
implementation of the current fuzzy checksums (only to not talk about
it in public), that leads me to the point that it is too fixed to some
characteristics of a spam-email body.

Then I had the idea to generate a new fuzzy checksum with the help of
the bayesian rule. So this checksum would be based on statistics and
current behaviour of spam.

What do you think about the following idea:

1. save all words of past emails and how often they occur in spam or
non-spam on every dcc-client (like it is done with bayesian filters)

2. then if new mail arrives at the client, go with a window of maybe
10 words over the body of the mail

3. calculate for every single window (15 words) the probability to be
spam (with bayesian chain rule)

4. then generate a checksum of the window with the highest probability
to be spam and report it to the dcc-server

5. get the current counter back and make your decision...

I think this would be a checksum that evolves with spam. The only
problem I see that the checksum is generated by such a small phrase.
This phrase would always be marked as spam and you cant use it in
legitimate mail (till it is not reported for a period of time to any
dcc-server and automatically deleted). But this is the case in current
bayesian filters too (you cant write something legitimate about
v....., dont even want to spell it here). To cut down the occurance of
legitimate phrases in the dcc-database, you could set a minimum for
the probability to be spam for the phrase. If it is under the minimum,
you could set the window to 14 words and do the whole thing again.
But I dont know if that is a good idea, maybe it would only lead to a
global bayesian database of single words (bad or good?).

The idea behind that is, that they can add a million random words, but
they cant say 'click here to buy' in a million different phrases.

Ok, I see that it would generate a global database of many forbidden
phrases, but you still have your whitelist... and maybe it is not a
bad idea.

What do you think of that? I am looking forward to some comments...
and I know there is something I forgot ;)

Greetings, Andreas





More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.