SPAM statistic

Vernon Schryver vjs@calcite.rhyolite.com
Mon Nov 10 18:43:31 UTC 2003


> From: Leandro Santi <lesanti@uolsinectis.com.ar>

> > The trick to a mailbox that is 99.9% spam free (spam leaked/total spam)
> > is using several independent filters.  For example, 4 independent
> > filters each 85% effective are better than 99.9% effective overall.
>
> (as long as the four filters are serially-connected and a single spam-type
> outcome is enough to mark the message as bad). This configuration requires 
> each one of the filters to have low false-positive rates (e.g. the DCC :).

Yes, the false positive rate (rejected good mail/total good mail) of
a series of filters connected this way is the sum of the false positive
rates and the false negative rate is the product of the individual
false negative rates.

You can turn it around and make the overall false positive rate be the
product of the individual rates and the false negative rate be the sum.
That's a reasonable way to use systems with high false postive rates.

SpamAssassin can be viewed as a framework for combining filters both ways.


> But I guess that some other filters aren't good enough to be used in
> serial configurations, you can't trust any of them individually because of 
> the higher false-positive rates.

Some systems claim to have very low false positive rates.  That's
often through creative accounting.  Many people refuse to count as
false positives or negatives mail that they examine and declare as
either in order to train their filters.  There are also people who
use some DNS blacklists and count as false positives or negatives only
mail that was handled incorrectly based only on its IP address;
they use the circular definition of spam any and all mail from listed
IP addresses.

On the other hand, some systems including some DNS blacklists do have
honest low false positive rates.


Vernon Schryver    vjs@rhyolite.com



More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.