Problem on dcc 1.3.30 - Continue Not Asking DCC...

Gary Mills mills@cc.umanitoba.ca
Fri Mar 10 02:36:56 UTC 2006


On Thu, Mar 09, 2006 at 10:07:44AM -0700, Vernon Schryver wrote:
> > From: Gary Mills 
> >
> > I've been seeing this problem for a long time, but it doesn't happen
> > every day.  It did this morning, though.  It seems to begin with
> > database cleaning:
> 
> Are both of your servers known to dccifd and with the same (or no) RTT
> adjustments?  If so, that should not happen.  Dccifd should switch to
> the other server when one fails to answer instead of giving up asking
> for exponentially increasing durations.

Yes, there's no RTT adjustment, at least not one that I've set
explicitly.  I do notice that `cdcc info' shows peculiar RTT values
some times.  Here's what it says just now:

  # 03/09/06 20:25:43 CST  /usr/local/dcc/map
  # Re-resolve names after 21:54:36
  # 2290.71 ms threshold, 4260.90 ms average    2 total, 2 working servers
  IPv6 off
  
  electra.cc.umanitoba.ca,-    RTT+0 ms    32769
  # *130.179.16.23,-                                              UofM ID 1032
  #     100% of 32 requests ok    0.48+0 ms RTT            0 ms queue wait
  
  zaurak.cc.umanitoba.ca,-    RTT+0 ms    32769
  #  130.179.16.67,-                                              UofM ID 1033
  #     protocol version 6
  #     100% of 32 requests ok  190.71+0+2000 ms RTT        0 ms queue wait
  
  ################
  # 03/09/06 20:25:43 CST  GreyList /usr/local/dcc/map
  # Re-resolve names after 21:54:36
  # 2104.02 ms threshold, 189.73 ms average    2 total, 2 working servers
  
  electra.cc.umanitoba.ca,-    Greylist RTT+0 ms    32769
  # *130.179.16.23,6276                                                ID 1032
  #     100% of 32 requests ok    0.38+0 ms RTT            0 ms queue wait
  
  zaurak.cc.umanitoba.ca,-    Greylist RTT+0 ms    32769
  #  130.179.16.67,6276                                                ID 9002
  #     protocol version 6
  #     100% of 32 requests ok    4.02+0+2000 ms RTT        0 ms queue wait

The other one did its database cleaning between 02:35:03 and 02:36:53.
The one local to the mail server did it between 03:49:17 and 03:54:34.
The first `no DCC answer' and `continue not asking' messages were at
03:54:23, towards the end of the cleaning interval.  Why didn't dccm
switch to the other dccd?

It's curious also that the last `continue not asking' message said
`2042 seconds after failure'.  That's 34 minutes, even though the
database cleaning took only five minutes.

> Is it possible that time appeared to dccifd to jump?  For example, if
> Solaris stops everthing to shuffle pages of dcc_db and dcc_db.hash from
> RAM to the disk, dccifd might think that it had worked hard without
> any answers from dccd.

I don't think so.  There were no other errors around the time of
the incident.

> >   Mar  9 03:54:36 electra dccm[4639]: [ID 702911 mail.error] fcntl(F_SETLKW F_WRLCK resolve lock /usr/local/dcc/map 3): Deadlock situation detected/avoided
> 
> I think that is an independent problem, albeit perhaps triggered by
> slowness.  I thought I had fixed it in 1.3.30, but evidently not.  I
> can't reproduce it and so have been just guessing.  Oh, well, back to
> code reading and theorizing.

I wonder if it's related.  Could a mutex lock cause the connection
to fail?

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-



More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.