dccifd failover under load

Ian Brewer Ian_Brewer@vwiz.co.uk
Wed May 3 08:23:57 UTC 2006


I am using version 1.3.31 of the client and server code. The only thing 
I have tweaked is the DCCD_ARGS to "-R 4000,50,300,0.1" to make sure the 
servers accept the request rate.

I have attached the output of cdcc info while stopping "server1".

Some interesting times are:

08:48:11 *server1 100% of 32 requests ok    0.55 ms RTT
           server2  75% of 32 requests ok    0.63 ms RTT

08:48:12  server1 100% of  4 requests ok   10.65 ms RTT
          *server2 100% of 32 requests ok    0.95 ms RTT

08:48:17  server1 100% of  4 requests ok   10.65 ms RTT
          *server2 100% of 32 requests ok    0.54 ms RTT

# skipping asking DCC server 64 seconds more
           server1  20% of 20 requests ok 3969.25 ms RTT
          *server2 100% of 32 requests ok    1.07 ms RTT

/var/log/maillog on the client has the following entries:
May  3 08:48:18 client dccifd[28141]: no DCC answer from server1 after 
6062 ms
May  3 08:48:18 client dccifd[28141]: no DCC answer from server1 after 
6032 ms

It looks to me that there are queued requests to server1 even though we 
are using server2 happily. When the client times out these requests, the 
switching code get confused somehow and I get a 64 second timeout.

Do you think this is a linux only issue? If it would help, I could 
install a freeBSD client and retest.


Vernon Schryver wrote:
>>From: Ian Brewer 
>>currently being used by the client, the second server is immediately 
>>picked (according to cdcc info), but the fail_more() code also fires and 
>>gives me 64 seconds worth of "Continue not asking" messages. Shouldn't 
>>the second server be available for use straight away?
> The second server should be available immediately.
>>Its almost as if any queued requests are counted as failed and causing 
>>the fail_more code to go off, even though a new working server has been 
> If somehow a bunch of requests do in fact fail, then something like
> that should happen.
>>I have included a modified dccif-test program that can be used to show 
>>the problem. I run the program, then "watch -n 1 'cdcc info'" on another 
>>  terminal while shutting the in-use server down.
> I doubt that `watch` on a FreeBSD system does what you intended, so I
> used `repeat 100 sh foo`, where foo was a shell script consisting of
> `cdcc info; sleep 1`
> I see no problems.  The count from your program spirals up and pauses when
> when I run `cdcc "id 101; stop`.  Then it resumes, albeit at a slower
> pace because the other servers are distant.
> What version of the DCC code are you using?  The current version is either
> 1.3.31 or 2.3.31 depending on whether you are using the commercial version.
> I've continued to work on the server switching machinery since the earliest
> releases.
> Exactly (other than passwords) what do you see from `cdcc info`?
> Vernon Schryver    vjs@rhyolite.com
> _______________________________________________
> DCC mailing list      DCC@rhyolite.com
> http://www.rhyolite.com/mailman/listinfo/dcc
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cdccoutput
URL: <http://www.rhyolite.com/pipermail/dcc/attachments/20060503/fa7ababd/attachment.ksh>

More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.