Problem on dcc 1.3.30 - Continue Not Asking DCC...

Breno Moiana breno@haxent.com.br
Fri Mar 10 13:38:00 UTC 2006


Hello, Vernon. 

At this very moment, the server is unresponsive. I will try to provide diagnostics info, and then I will do the things you suggested (update, remove the anon, etc):

1. The DCC server

1.1
# cdcc "host localhost; stats"
localhost 127.0.0.1,512
        server-ID 1004  09:42:39
    version 1.3.30  tracing ANON CLNT
43646973 hash entries 27081311 used 1586469396 DB bytes
   50 ms delay  7035 NOPs  1233 ADMN  1 query  3 clients since 14:56:13
5769835 reports       1>10       1>100       1>1000       1 many
        answers 3571133>10 3202243>100 2526188>1000  549682 many
    0 bad IDs  407 passwds  37 error responses  5929 retransmitted
    0 answers rate-limited   0 anonymous           0 rejected reports
    flood on   3 streams   2 out active 2 in 94327213 total flooded in
74212662 accepted 1342906 stale 18755163 dup  16297 white    0 delete
72019640 reports added between Feb 21 10:53:16.629997 BRT and Mar 10 09:42:39

1.2
# cdcc info
Notice that other servers have their RTTs adjusted to +1000ms, while my own is adjusted to -1000ms. I don't think that is relevant to this case at the point we are now though.

(...)
dcc5.dcc-servers.net,-      RTT+1000 ms  anon
#  80.69.8.186,-                                                  MC ID 1128
#     protocol version 4
#     100% of 32 requests ok  294.47+1000+2000 ms RTT      101 ms queue wait
#  212.203.14.116,-                                        EATSERVER ID 1166
#      97% of 32 requests ok  337.92+1000 ms RTT        89 ms queue wait

127.0.0.1,-                 RTT-1000 ms  anon
# *127.0.0.1,-                                               OiComBR ID 1004
#     100% of 32 requests ok   50.18-1000 ms RTT        50 ms queue wait

################
# 03/10/06 09:46:44 BRT  GreyList /var/dcc/map
# Re-resolve names after 11:31:23
# 1 total, 0 working servers

127.0.0.1,-                 Greylist 32768 4228787756x304
#  127.0.0.1,6276
#      not answering
(EOF)



2. The server
2.1 top:
Below you can see a sample from "top" while the server is unresponsive; the values float around these numbers, and dccifd and mysql take turns on who is on top, but never deviating much from this pattern.
Everything else uses 0 %CPU and 0 %MEM.

top - 09:40:05 up 30 days, 18:44,  4 users,  load average: 0.27, 0.22, 0.21
Tasks:  70 total,   1 running,  69 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.1% us,  0.8% sy,  0.0% ni, 94.2% id,  0.6% wa,  0.1% hi,  0.2% si
Mem:   3607316k total,  3583480k used,    23836k free,    14984k buffers
Swap:  2096472k total,      576k used,  2095896k free,  2923016k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6390 root      15   0 71128  10m 1356 S 11.0  0.3  84:45.62 dccifd
10528 mysql     16   0  700m 490m 4024 S  8.3 13.9   1371:37 mysqld
 6289 root      15   0 2021m 1.4g 1.4g S  1.3 41.1   2:03.66 dccd
(...)

2.2 vmstat:
# vmstat

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0    576  23588  15500 2923020    0    0   118    36    0     4  3  1 94  3


2.3 ps:
This might be interesting. /etc/init.d/dcc start loads three processes: dccd, and two dccifd:

USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
(...)
root      6289  0.1 41.7 2069512 1504976 ?   Ss   Mar09   2:16 /var/dcc/libexec/dccd -i1004 -n OiComBR -R 4000,5000,3000,1
root      6389  0.0  0.0  6048  304 ?        Ss   Mar09   0:00 /var/dcc/libexec/dccifd -tCMN,,300 -wwhiteclnt -Uuserdirs -A -p 2
root      6390  7.6  0.2 71128 10252 ?       Sl   Mar09  88:54 /var/dcc/libexec/dccifd -tCMN,,300 -wwhiteclnt -Uuserdirs -A -p 2
(...)

See the two dccifd processes? Only the second one seems to be actually working, and all the countdown entries on the log are from that process:
(...)
Mar 10 10:18:43 dcc last message repeated 11 times
Mar 10 10:18:44 dcc dccifd[6390]: continue not asking DCC 1758 seconds after failure
Mar 10 10:18:44 dcc last message repeated 24 times
Mar 10 10:18:45 dcc dccifd[6390]: continue not asking DCC 1757 seconds after failure
(...)


*** "cdcc rtt" resetted the counter and the server got back up on its feet, at least this time. It is working now.
Here are some comparative results:


# cdcc "host localhost; stats"
localhost 127.0.0.1,512
        server-ID 1004  09:53:54
    version 1.3.30  tracing ANON CLNT
43646973 hash entries 27110548 used 1590558552 DB bytes
   52 ms delay  7039 NOPs  1236 ADMN  1 query  3 clients since 14:56:13
5770823 reports       1>10       1>100       1>1000       1 many
        answers 3571543>10 3202632>100 2526293>1000  549708 many
    0 bad IDs  407 passwds  37 error responses  5929 retransmitted
    0 answers rate-limited   0 anonymous           0 rejected reports
    flood on   3 streams   2 out active 2 in 94380628 total flooded in
74257864 accepted 1342906 stale 18763369 dup  16304 white    0 delete
72065891 reports added between Feb 21 10:53:16.629997 BRT and Mar 10 09:53:54


# vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0    576  23820  17120 2921660    0    0   118    36    1     4  3  1 94  3


Top behaves a little differently now that the server is working: dccifd remains on top for most of the time, and reaches higher peaks, going all the way up to 20%CPU (I've seen 35% peaks)

top - 09:57:24 up 30 days, 19:01,  4 users,  load average: 0.37, 0.43, 0.33
Tasks:  70 total,   1 running,  69 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.1% us,  0.9% sy,  0.0% ni, 88.0% id,  4.5% wa,  0.0% hi,  0.5% si
Mem:   3607316k total,  3584100k used,    23216k free,    17556k buffers
Swap:  2096472k total,      576k used,  2095896k free,  2920444k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6390 root      15   0 71128  10m 1356 S 17.6  0.3  86:50.70 dccifd
10528 mysql     16   0  700m 490m 4024 S 10.7 13.9   1373:01 mysqld
 4734 root      16   0  6208 1004  772 R  0.3  0.0   0:00.08 top
(...)


Now on to the step-by-step answer of your e-mail... :)



Vernon Schryver wrote:

>>From: Breno Moiana 
>>    
>>
>>1. At least one time, the server stopped responding at the time of the 
>>database cleanup. However, it remained that way for about 8 hours. Then 
>>    
>>
>
>
>What was the state of the server for those 8 hours?  The command
>`cdcc "host localhost; stats"` should have given some clues.
>  
>
I don't knowfor sure, those 8 hours were from 00:00 to 08:00am. 
According to the logs, it didn't get back to a responsive state during 
that time (there were no significant time windows without error 
messages), so I believe the situation would have been very much like the 
one I described earlier in this email.

>] >That volume could easily justify a second DCC server.  The DCC client
>
>] We have thought about it, and now that you mentioned it as a possible 
>] solution, we will look carefully into it as a solution.
>] I know this might be a stupid question, but how can I verify the need 
>] for a secondary server? I mean, the CPU is constantly idle, and nearly 
>] all my memory is being used for cache... where is the bottleneck? 
>
>My best guess is that Linux is doing bad things, but that's only a guess.
>It would be good to check the state of the system and dccd when dccd
>is not answering.  Besides `cdcc stats`, I would check the system
>with ps, vmstat, top, and so forth.
>  
>
Done. the info is there, I appreciate any comments on those. I didn't 
notice anomalies, but I have been looking at these for so long that 
everything is kind of blurred already. Maybe somebody who hasn't been 
living with the problem for the last weeks will come and tell me "hey, 
this is wrong!". I would love a magic answer :D

>] all my memory is being used for cache... where is the bottleneck? Would 
>] a second server help me even if I have a lot of unused hardware on this 
>] server already?
>
>The second server would be need only when the first server is not answering.
>  
>
Oh, okay. I thought you meant that with 3 million messages, I would be 
overloading something that should be split between two servers for 
balancing purposes, not only for redundancy and service availability.

>] An option I can think of is to install VMWare on this machine, and make 
>] two servers in this hardware. Should this work ?
>
>I doubt that would be a good thing.  Can dccd even run in Linux
>over VMWare?  If all of RAM is tied up by one dccd instance, running
>a second instance would not help.  If it could help, then it would be
>better to just run two DCC servers, one on port 6277 and the other on
>some other port.   I don't recommend doing that, although for various
>reasons I run 2 and sometimes 3 copies of dccd on my personal FreeBSD box.
>  
>
Well, I was thinking in a VMWare GSX server, which seems to be pretty 
reliable for server virtualization. Running two linux servers in one 
box, I would not only save the expense of new hardware (already got the 
VMW license), but could even set up different configurations on each one 
(maybe leave one with some other kernel version), to see which one 
stands out as more stable.

>] 127.0.0.1,-                 RTT-1000 ms  anon
>] # *127.0.0.1,-                                               OiComBR ID 1004
>] #     100% of 32 requests ok   51.57-1000 ms RTT        50 ms queue wait
>
>As I said, 50 ms is not good.  However, that `cdcc info` output says
>that the server is not really that slow.  Your client does not have a
>valid DCC client-ID and password but is instead anonymous.  Your
>server is imposing the default anonymous client delay of 50 ms.  You
>should assign a client-ID such as 32768 in /var/dcc/ids on the server,
>and then tell your DCC clients to use it with the command
>  
>
I will change that now and see if it helps.

>    cdcc "add  127.0.0.1   RTT-1000 ms   32768  secret"
>
>where "secret" is password for 32768 in /var/dcc/ids.
>
>After switching to a non-anonymous client-ID, you will only want to
>send mail with the results of `cdcc info` run as an ordinary user and
>not root.  When run as root, `cdcc info` and `cdcc rtt` show secret
>passwords.
>  
>
Okay, point taken. Will do.

>] I don't think so... right now, the database is at 1.35GB, and it is not 
>
>I hope you are using `cdcc stats` to measure the size of the database.
>Many UNIX varients do not do the right thing to dates or even sizes 
>in the inode of files changed with mmap().
>  
>
Not really, I used ls -l for that. However, I was looking at it now, and 
cdcc stats also points to db size of around 1.4GB.

>] About the "continue not asking" mechanism, I noticed that sometimes the 
>] system just gets out of its idleness and gets back to a responsive 
>] status, without any interference. 
>
>When the "continue not asking" mechanism counts down, the client tries
>again.  If things fail again, the count is doubled (unless it is already
>at the maximum) and resumes counting down.
>  
>
Yes, I had the impression this was happenning. My point wasn't the 
countdown per se, but the fact that it both stops and gets back to 
responding without any interference.

>]                                   Another thing is that even when I 
>] stop/start the service, it keeps counting from when it was. Can I reset 
>] the counter? Some command to tell the server: "Hey, try it now, let's 
>] see if what I did worked for you".
>
>The counting is done in the DCC client, not the server.   THe server 
>might be down entirely and so cannot be trusted to maintain the count.
>I think many times `cdcc rtt` will reset the client.
>  
>
While I am writing this email, the server stopped responding twice 
already. On both times, running "cdcc rtt" successfully brought the 
service back up.

>} What is that 2016 MByte window? Can it be enhanced? Should it? I seem to 
>} have plenty of memory:
>
>The "window" is the total size of the mmap() pool of buffers mapped
>onto the file.
>Perhaps 2 GByte is too large for your version or installation of Linux.  
>Do you have plenty of swap space?
>
>It might be interesting to try rebuilding with a smaller "window" with
>
>   /var/dcc/libexec/updatedcc -c --with-max-db-mem=1500
>
>If that does not help, remember to restore default by
>   
>   /var/dcc/libexec/updatedcc -c --without-max-db-mem
>  
>
Doing that as soon as I send this e-mail.

>
>Vernon Schryver    vjs@rhyolite.com
>_______________________________________________
>DCC mailing list      DCC@rhyolite.com
>http://www.rhyolite.com/mailman/listinfo/dcc
>
>  
>
Thanks one more time for the help, Vernon. I will be rebuilding dcc with the smaller window and setting the clients to not use anon connections. Let's hope it helps.

I also hope that the info I gathered here is useful to help tracing back to the problem, so we can find and eliminate the problem.

Thanks very much for the attention!

Best Regards,

Breno Moiana
=====================
Haxent Consulting




More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.