[ SPAM ] Re: Good starting numbers for spamassassins dcc

Michał Grzędzicki lazy@iq.pl
Sat May 2 16:30:10 UTC 2009


Wiadomość napisana w dniu 2009-05-02, o godz. 15:39, przez Vernon  
Schryver:

>> From: =?ISO-8859-2?Q?Micha=B3_Grz=EAdzicki?= <lazy@iq.pl>
>
>> By default spamassassin uses 99999 as dcc_body/fuz1/fuz2_max whitch  
>> is  =
>> same as dcc's many.
>> This is olny 1/6th of the messages.
>
> Are you referring to the difference between 19% tagged as "many" and
> the 51% with bulky counts according to the graphs for your server at
> https://www.rhyolite.com/dcc/private/.... ?
yes, so trapped spam is 'many'  and likely spam is somethng around 10 ?


> That is a low value.  Are you doing DCC filtering after other filters?
only some rbl blackholling + spf is done before
i'm talking only about messages that are getting DCC_CHECK score for  
having "many" occurences and this is only 17 000 out of 102 000
this is coused by spamassassin always requiering many reports

>
>> I'm planing to add variable scoring to spamassassins DCC.PM to make  
>> it  =
>> more usefull ( now only messages with many reports are flagged).
>> I'm thinking about 40 reports getting 1/10 of the base score to 10  
>> 000  =
>> reports (or many, where does it start ?) getting whole base score,
>> 500 reports may be treated as likelly spam with 1/2 of base score  
>> in  =
>> beatween maybe use 2 linear functions or one of higher order.
>>
>> Base score should be around 4/5 of mark as spam score.
>>
>> What would be good threstholds for wery unlikely spam, likelly  
>> spam,  =
>> surelly spam.
>
> I doubt that would help.  The DCC detects bulk email.  Spam is  
> unsolicited
> bulk email.  Mail messages that have been seen 100 or 10,000 times are
> equally bulky, and neither is more likely to be spam.  Contrast Amazon
> online order confirmations with Amazon advertisements.  Both are very
> bulky, but only some of the Amazon advertisements are spam.
>
> That is why I have always said the best way to use DCC is with per- 
> user
> whitelists.  Each user's whitelist indicates which streams of bulk  
> mail
> are solicited.
>
>
> I think the SpamAssassin threshold of "many"/99999 is far too high.
> The SpamAssassin conversion of "many" to 99999 is kludge that should
> not have been code.
> Instead, SpamAssassin should look for "bulk" in the X-DCC header
> and the dccifd or dccproc thresholds should tell dccifd or dccproc
> whether to add "bulk".  See DCCM_REJECT_AT and DCCM_REJECT_AT
> in /var/dcc/dcc_conf.  See also -c and -t in the dccproc and dccifd
> man pages; -t could be added to DCCIFD_ARGS

You are right, for now i will lower the requrements in spamassasin.  
Maybe in feature i will add some "bonus" points to popular spams with  
more then 1000 reports to make sure they are spam flaged eaven other  
filters didn't engage them.

DCC.pm checks for X-DCC: bulk only if it has been added upstream,  
dcc_conf mentiones 50 as bulk mail count so i will start with  
something around 200
with lower score value (most of those spams whitch get threw alredy  
have some points from other checks)


whitelised emails aren't scanned at all so I don't have to worry about  
that at dcc level


>
>
>> I'm guessing this is the right aligment body fuz1 fuz2 checksums  
>> with  =
>> body getting most reports and fuz2 least reports.
>> Is this right?
>
> If I understand the question, no.  All of the checksums are computed
> on all mail messages, but only reports of the most bulky checksums are
> flooded among DCC servers.  Body checksums are not at all fuzzy, and
> so minimal personalizations can make each copy of spam have differing
> DCC body checksums.

I guess i wasnt clear about that sorry.
I hope this time it will be clearer.
Are fuz1 and fuz2 computed from same parts of email eg. sender,  
subject, X-Client + body, or fuz2 takes more headers ? Then wery  
simillar spams can have same body hash same fuz1 but difrend fuz2  
because fuz2 takes in acount X-Client header whitch difers in this 2  
spams or mayby they take same subset of email, header + body but use  
difrend fuzzing algoritm (like omiting whitespaces ignoring case ect.  
to ignore minor diferences in spams)

If they use same subset of headers + body there's no point in  
diferenting threstholds for fuz1 and fuz2, and if fuz2 inputs more  
data it should have smaller thresthold then fuz1.


-- 
Michał Grzędzicki




More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.