dccifd failover under load

Vernon Schryver vjs@calcite.rhyolite.com
Wed May 3 14:00:41 UTC 2006


> From: Ian Brewer <Ian_Brewer@vwiz.co.uk>

> I am using version 1.3.31 of the client and server code.

As I pointed out to you on February 16 when you asked here about running
two servers on a single computer, running private servers not connected
to the public network of DCC servers violates the current DCC license.
See the LICENSE file in your copy of the source or
http://www.rhyolite.com/anti-spam/dcc/dcc-tree/LICENSE

Am I confused or right about your servers not being connected to the
global network of DCC servers?  Why should I expend billable hours for
someone who is violating the free license on my code?  Or the time I
could be spending working on improvements that might be used by other
people who share their DCC checksums with everyone else?

Are you doing as you said, and running 2 servers on a single computer?
If so, do you have enough RAM so that both can be resident?  Are you
avoiding the stalls when Linux stops everything to slop the database
between swap space and the files?


> /var/log/maillog on the client has the following entries:
> May  3 08:48:18 client dccifd[28141]: no DCC answer from server1 after 
> 6062 ms
> May  3 08:48:18 client dccifd[28141]: no DCC answer from server1 after 
> 6032 ms
>
> It looks to me that there are queued requests to server1 even though we 
> are using server2 happily. When the client times out these requests, the 
> switching code get confused somehow and I get a 64 second timeout.

The "75% of 32 requests ok" reports about server2 while you were using
server1 (as indicated by the asterisk (*)) show that something is wrong
with server2.  When you shut down server1 and switch to server2, whatever
is causing those problems may have affects.  For example, when server2
doesn't answer as quickly as its RTT says it should, then the clients
will probe server1.

The "skipping asking DCC server 64 seconds more" complaints from cdcc
show that cdcc, not just dccifd, is unable to reach either server.

You don't say how you shut down the servers, whether you use `kill -9`
or something that lets dccd sync the database.  Judging from the
cdcc output, you are restarting the servers after you stop them.
If you use something brutal like `kill -9`, dbclean will have to run
and soak up lots of disk bandwidth, CPU cycles, and RAM.

Consider the effects on all processes of closing a large file.  If both
servers are on a single computer, closing one large file can make
everything stop.  (Yes, I saw the two net-10 IP addresses.  Perhaps you
are using aliases on network interfaces to run two servers on a single
computer.)


Vernon Schryver    vjs@rhyolite.com



More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.