FuzzyOcr 3.5.1 released

John Scully jscully@isipi.com
Mon Jan 8 02:14:59 UTC 2007


I wonder if Vernon Schryver at rhyolite could tie fuzzy OCR into the DCC 
(distributed Checksum) project.  We operate one of the several hundred nodes 
in the DCC network, and it has been a great tool in spam control.  For 
anyone who is not familiar with it, DCC is a network of public and private 
servers that exchange floods of millions of bulk mail "fingerprints" based 
both on "spamminess" and just general bulk of the mailings. info at 
www.rhyolite.com

The advantage is that the DCC servers keep their checksum DB in memory and 
are lightning fast.  The OCR check would be a lot more intensive then the 
current conversion of a mail body into a set of checksums...but it would 
allow the network of servers to exchange the fingerprints of spam images.

To give you an idea, our DCC server currently has these stats:  The key 
items - 22,057,457 checksums in memory, using a little over 1.1G of RAM.  We 
receive about 4,000 reports per minute from the network and send about 200 
per minute from emails we process.

Of course, you only need to run your own DCC server if processing well over 
100,000 emails per day.

John Scully
isupportisp.com


----- Original Message ----- 
From: "Andy Dills" <andy@xecu.net>
To: "decoder" <decoder@own-hero.net>
Cc: <devel-spam@lists.own-hero.net>; <users@spamassassin.apache.org>
Sent: Sunday, January 07, 2007 5:42 PM
Subject: Re: FuzzyOcr 3.5.1 released


>
> On Sun, 7 Jan 2007, Andy Dills wrote:
>
>> On Sun, 7 Jan 2007, decoder wrote:
>>
>> > -----BEGIN PGP SIGNED MESSAGE-----
>> > Hash: SHA1
>> >
>> >
>> > Hello all,
>> >
>> >
>> > since 3.5.0 RC1 was released, we fixed many bugs, thanks to the many
>> > testers and bug reporters :) so big thanks.
>>
>>
>> I have something I'm curious about, having run FuzzyOcr in a medium size
>> (3-400k messages per day) mail cluster for about a week now.
>>
>> Why do you do database maintenance with every unmatched check?
>>
>> >From Hashing.pm:
>>
>>         unless ($match) {
>>             my $then = time - ($conf->{focr_db_max_days}*86400);
>> --->        $sql = qq(select * from $db.$dbfile order by $dbfile.check);
>>             my $sth  = $ddb->prepare($sql); $sth->execute;
>>             while (my @row = $sth->fetchrow_array) {
>>                 my $hash2 = $row[1] || "0:0:0:0";
>>                 $hash2 .= "::$row[0]";
>>                 if (within_threshold($digest,$hash2)) {
>>                     $txt   = 'Approx';
>>                     $key   = $row[0];
>>                     $next  = $row[5] + 1;
>>                     $when  = $row[7] || $now;
>>                     $ret   = $dbfile eq $conf->{focr_mysql_hash} ? 
>> $row[8] : $row[5];
>>                     $dinfo = $row[9] || '';
>>                     infolog("Found[$dbfile]: Score='$row[8]' Info: 
>> '$row[9]'");
>>                     last;
>>                 }
>>             }
>>             # Expire old records...
>> --->        $sql = qq(delete from $db.$dbfile where $dbfile.check < 
>> $then);
>>             debuglog($sql,2);
>>             $ddb->do($sql);
>>         }
>>
>>
>> Those two queries are extremely expensive in a larger envrionment...I 
>> have
>> commented this code segment out on our cluster, and have written a quick
>> maintenance script that runs once per day...dropped the response time 
>> from
>> 2-3s to .01-.05s on queries, and eliminated the suddenly large
>> and customer-annoying mailqueues.
>
> Sorry to follow up to my own post, but now that I read this segment a
> little closer I realize that I'm basically commenting out the matching
> capability of the Hashing mechanism, eliminating all value of the Hashing
> in the first place.
>
> So...I guess my point is, unless there is a better way of determining the
> match than checking every single hash in the database (hoping that you
> find one that is close enough along the way), it's more efficient (in
> larger environments at least) to just scan each mail message without
> hashing enabled.
>
> Thoughts?
>
> Andy
>
> ---
> Andy Dills
> Xecunet, Inc.
> www.xecu.net
> 301-682-9972
> ---
>
> 





More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.