dccm causing tempfails

Vernon Schryver vjs@calcite.rhyolite.com
Fri Sep 1 16:30:33 UTC 2006

> From: Bart Dumon <bartdu@bsp.scarlet.be>

> yes, i'm getting these:
> Sep  1 17:00:17 rago dccm[17715]: 23 too many simultaneous mail messages

Those are the "smoking gun."

> and these occur too from time to time:
> Sep  1 16:47:47 pisa dccm[31761]: DCC, mi_rd_cmd: read returned -1: Connection reset by peer

Those suggest that by the time dccm was ready to talk some more to sendmail
about a mail message, sendmail had given up and shut down the connection.

Are you using TCP or UNIX domain sockets for the milter connection
between sendmail and dccm?  The default is a UNIX domain socket, but
if sendmail and dccm are on separate computers, TCP must be used.

The "too many simultaneous mail messages" problem can be caused by:

  - default value of `dccm -j` too low 
      The default value is based on 
	 the apparent limit on the number of file desciptors (FDs), which
	    comes from 
	      the limit on select() FD sets, FD_SETSIZE, unless the local
		 UNIX flavor supports poll().  That can be seen by running
		 the ./configure script or by 
		     /var/dcc/libexec/updatedcc -v
	       the getrlimit(RLIMIT_NOFILE) value, or the current limit
		   on the number of open files per job
	 5 FDs per job, 32 FDs for libc overhead such as DNS,
	    and 20 FDs for per-user whitelists
	 a default upper limit of 200 jobs

      The value of -j actually used can be seen by add -d to DCCM_ARGS in
      /var/dcc/dcc_conf and restarting dccm with /var/dcc/libexec/start-dccm

   - dccm running too slowly
       This can be caused by
	  dccd answering too slowly
	  slow or lossy network between dccm and dccd
	  dccm slowed by waiting for -B answers (does not apply without -B)
	  other jobs on the computer running dccm

The easiest change is to a -jX setting to DCCM_ARGS in /var/dcc/dcc_conf
and restart dccm with /var/dcc/libexec/start-dccm.  However, that is rarely
needed and might only cover up the real problem.

Whether dccd is too slow for any reason can be seen with
    cdcc info
on the system running dccm.  The averate round trip time (RTT) for
a local DCC server should be a very small number of milliseconds.

Judging from 
    cdcc "host dcc1.scarlet.be; stats"
part of the problem might be network problems between the dccm and dccd.  
Those status counts include "22050 retransmitted".  Those are instances
when dccd realized that the client (probably dccm) was retransmitting
a request.  On a LAN without hardware problems (e.g. bad 10/100/FDX/HDX
negotiation or bad cables), there should be fewer retransmissions.
When I looked, dccd was claiming an average service time of "0 ms delay"
so at least at that time, dccd was responding entirely fast enough.

Another possibility occurs to me.  You have a single DCC server.
When it is busy running the nightly dbclean cron job, dccd will be
slow.  More than that, it will claim an artificially inflated queue
service time to encourage clients to switch to another DCC server.
The public DCC servers should be 100, 1000, or more slower than
than a local DCC server.  Internet delays alone are likely to make
them respond after 100s of milliseconds.
If the public servers are mentioned in /var/dcc/map on your client
systems, your DCC clients might switch to the public DCC servers.
If you did not include an RTT bias when you configured your DCC clients
to use your DCC server, your clients might not switch back as quickly 
as they should.

Vernon Schryver    vjs@rhyolite.com

More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.