fuzzy check sums

Michael Grant mg-dcc1@grant.org
Sun Mar 17 07:52:10 UTC 2002


I'm new to this list.  I must admit that I've had the idea of using
fuzzy checksums to spot spam for years.  Recently, I started working
on something to do this, then a couple days ago, a friend pointed me
at the dcc project.  Oh well, it figures, someone had to have had the
same idea!

I have made some interesting headway on my own fuzzy functions.  I had 
a brief look at the fuz1 and fuz2 in the source.  fuz1 seems to be
based around md5.  I was never able to get enough fuzz out of using
md5 myself, even doing md5 sums per line and such.

What I found that worked surprisingly well was simply to take the
root-mean-squares of the space separated words on each line converted
to numbers in messages.  I'm happy to share the code.  Should I post
it here or what?

I also ran some tests to see how many false positives I would catch
based on my old email.  For me, it was about 1 in 150,000 and I have
to say that the 1 message did resemble quite a bit one of the spams in 
my spam file.

Michael Grant



More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.