Vernon Schryver vjs@calcite.rhyolite.com
Thu Nov 23 14:44:00 UTC 2006

> From: Daniel Gehriger 

> > What version and flavor of UNIX-like operating system are you using?
> > The main dccifd or dccm process uses waitpid(WNOHANG) to know when
> > helper processes have gone away.  That works well on FreeBSD, but
> > maybe not on other flavors.
> > 
> > After every failure to receive an answer from a helper, zombie
> > helpers are reaped.
> > After 5 failures to receive an answer, all of the helpers are terminated
> > and restarted.
> > 
> > There may be a race in restarting idle helpers.  I'll look into it.
> I'm using SuSE Linux 9.1, so that should be fine.

Actually, I'd bet the opposite.  The rate of ill considered changes
and flat-out bugs appearing in Linux seems to be increasing.

>                                                   I agree that this 
> looks like a race condition.

That is not exactly what I meant to say.  The race I suspect but have
not verified might happen 50% of the time to the first 5 mail messages
that arrive after at least 60 but fewer than 65 seconds of inactivity.
If waitpid() is working as well as it does on FreeBSD and Solaris, the
last helper will be reaped within 5 seconds.
That possible race does not fit with the mention of failures continuing
for hours.

A related problem with the description is that after 5 failures, all of
the helpers are killed and a new batch started.

thanks for reporting and diagnosing the problem,
Vernon Schryver    vjs@rhyolite.com

