Problem on dcc 1.3.30 - Continue Not Asking DCC...

Vernon Schryver vjs@calcite.rhyolite.com
Thu Mar 9 16:58:33 UTC 2006


> From: Breno Moiana 

> We have a DCC server set up on an email provider, handling around 3 
> million email messages a day.

That volume could easily justify a second DCC server.  The DCC client
code prefers the fastest working known DCC server.  When the currently
chosen server stops working, it tries another.

> Without any apparent reason, something happens to DCC that makes it stop 
> responding. Here is the log from the beginning of the problem:
>
> : Mar  9 09:29:38 dcc dccifd[4782]: no DCC answer from 127.0.0.1,6277 
> after 18264 ms
> : Mar  9 09:29:38 dcc dccifd[4782]: continue not asking DCC 64 seconds 
> after failure

The "continue not asking" messages mean that dccifd, dccproc, or dccm
has seen consecutive failures while trying to talk to the DCC server
and so is passing all mail.  In many situations, it is better to fail
by passing all mail than to block all mail.


All UNIX flavors I've looked closely at for dccd performance deal poorly
with large mmap() files.  None of them seem to properly page or
swap-to-file as they should for mmap() files.  Solaris is not good but
least bad.  Linux is worst.  I've watched Linux grind a halt as it
apparently slops the entire dccd database from swap space on the disk
to the filesystem, also on the disk.  FreeBSD is between the extremes.
It sometimes decides to push the entire database from RAM to the file
in a single effort.  When you're talking about GBytes, the rest fo the
system gets very slow or even stops for tens of seconds.

Dccd has lots of code that periodically tries to encourage the operating
system to flush parts of the database to the file.  I've never found a
combination that really works on any UNIX flavor.  Msync() generally
seems to do nothing.  Madvise() seems to be useless.  Fsync() after
every operation would probably prevent the hiccups, but would make every
operation take 10s instead of fractions of milliseconds.

> Please notice that the RTT to the server remains low all the time, at 
> around 50ms.

50 ms is a fairly large RTT for a local server.


> Not always, when I manually run the cron-dccd script, the errors stop:

> : Mar  8 17:54:22 dcc dccd[4748]: 1.3.30 database /var/dcc/dcc_db 
> reopened with 2016 MByte window

Could the database be growing larger than 2 GByte, and then Linux going
into its crazy mode of swapping the mmap() dcc_db and dcc_db.hash files
to swap space?  I ask because dbclean run by the cron script will 
shrink the file.


> Any help will be greatly appreciated, as we are falling into RBLs every 
> other day, due to the eventual lack of DCC service (we allow email to 
> pass when the DCC doesn't respond)

If it is better for the DCC client to fail by blocking mail, then
you could add -x to DCCM_ARGS or DCCIFD_ARGS in /var/dcc/dcc_conf.
That has two effects.  It turns off the "continue not asking" mechanism
so that the DCC client asks every time.  Second, it causes dccm or
dccifd (when dccifd is in proxy mode such as a postfix before-queue
filter) to tell the local MTA to give the distan client MTA or mail
sender a 4yz try-again failure.

Perhaps the best thing to do is to run 2 local DCC servers, each
flooding the other.  Each should run the cron job (and so dbclean)
at different times, and perhaps more than once per day.  Each 
should be known in /var/dcc/map files on DCC client systems.


Vernon Schryver    vjs@rhyolite.com



More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.