Problem on dcc 1.3.30 - Continue Not Asking DCC...

Breno Moiana breno@haxent.com.br
Thu Mar 9 19:06:35 UTC 2006


Okay, new information.

My service stopped responding again. Here is what I gathered:

I left a "while true ls -l dcc_db" running, and the filesize was stable 
and then changed:

---cut
1387266048 Mar  9 14:57 dcc_db
1387266048 Mar  9 14:57 dcc_db
1387266048 Mar  9 14:57 dcc_db
1403781120 Mar  9 15:25 dcc_db
1403781120 Mar  9 15:25 dcc_db
1403781120 Mar  9 15:25 dcc_db
---/cut

At this precise moment, here is what happened in the log (notice that 
the previous entry is timed at 25 minutes before - the server was 
working during that time):

---cut
Mar  9 14:59:24 dcc dccd[6289]: 1.3.30 database /var/dcc/dcc_db reopened 
with 2016 MByte window
Mar  9 15:25:19 dcc dccifd[6390]: missing message body
Mar  9 15:28:18 dcc dccifd[6390]: missing message body
Mar  9 15:29:55 dcc dccifd[6390]: no DCC answer from 127.0.0.1,6277 
after 16027 ms
Mar  9 15:29:55 dcc dccifd[6390]: continue not asking DCC 64 seconds 
after failure
Mar  9 15:29:55 dcc last message repeated 9 times
Mar  9 15:29:56 dcc dccifd[6390]: continue not asking DCC 63 seconds 
after failure
Mar  9 15:29:56 dcc last message repeated 5 times
---/cut


Now, I ran /etc/cron.daily/cron-dccd, the db  size is the bigger one (of 
the previous two):
 1403781120 Mar  9 15:50 dcc_db


and it is working again, after this:

---cut
Mar  9 15:52:25 dcc last message repeated 6 times
Mar  9 15:52:26 dcc dccifd[6390]: continue not asking DCC 800 seconds 
after failure
Mar  9 15:52:26 dcc last message repeated 4 times
Mar  9 15:52:27 dcc dccd[6289]: 1.3.30 database /var/dcc/dcc_db reopened 
with 2016 MByte window
---/cut

So, the filesize doesn't seem to be the villain here.
What is that 2016 MByte window? Can it be enhanced? Should it? I seem to 
have plenty of memory:

---cut
# free
             total       used       free     shared    buffers     cached
Mem:       3607316    3582912      24404          0       5204    2939816
-/+ buffers/cache:     637892    2969424
Swap:      2096472        576    2095896
---/cut

Now things are working again, I am watching to see for how long, and 
what happens in between.

Best Regards,

Breno Moiana.
================
haxent Consulting




Breno Moiana wrote:

> Hello, Vernon.  Thanks for the quick reply!
>
> About your considerations:
>
>
>
> Vernon Schryver wrote:
>
>>> From: Breno Moiana   
>>
>>
>>  
>>
>>> We have a DCC server set up on an email provider, handling around 3 
>>> million email messages a day.
>>>   
>>
>>
>> That volume could easily justify a second DCC server.  The DCC client
>> code prefers the fastest working known DCC server.  When the currently
>> chosen server stops working, it tries another.
>>  
>>
> We have thought about it, and now that you mentioned it as a possible 
> solution, we will look carefully into it as a solution.
> I know this might be a stupid question, but how can I verify the need 
> for a secondary server? I mean, the CPU is constantly idle, and nearly 
> all my memory is being used for cache... where is the bottleneck? 
> Would a second server help me even if I have a lot of unused hardware 
> on this server already?
>
> An option I can think of is to install VMWare on this machine, and 
> make two servers in this hardware. Should this work ?
>
>
>>> Without any apparent reason, something happens to DCC that makes it 
>>> stop responding. Here is the log from the beginning of the problem:
>>>
>>> : Mar  9 09:29:38 dcc dccifd[4782]: no DCC answer from 
>>> 127.0.0.1,6277 after 18264 ms
>>> : Mar  9 09:29:38 dcc dccifd[4782]: continue not asking DCC 64 
>>> seconds after failure
>>>   
>>
>>
>> The "continue not asking" messages mean that dccifd, dccproc, or dccm
>> has seen consecutive failures while trying to talk to the DCC server
>> and so is passing all mail.  In many situations, it is better to fail
>> by passing all mail than to block all mail.
>>  
>>
> I completely agree. That's why I have been falling into RBLs. I think 
> that getting into the occasional spamcop list is better than not 
> delivering mail. Besides, we do have other filters in place, so most 
> of our spam is still filtered out.
>
>> All UNIX flavors I've looked closely at for dccd performance deal poorly
>> with large mmap() files.  None of them seem to properly page or
>> swap-to-file as they should for mmap() files.  Solaris is not good but
>> least bad.  Linux is worst.  I've watched Linux grind a halt as it
>> apparently slops the entire dccd database from swap space on the disk
>> to the filesystem, also on the disk.  FreeBSD is between the extremes.
>> It sometimes decides to push the entire database from RAM to the file
>> in a single effort.  When you're talking about GBytes, the rest fo the
>> system gets very slow or even stops for tens of seconds.
>>
>> Dccd has lots of code that periodically tries to encourage the operating
>> system to flush parts of the database to the file.  I've never found a
>> combination that really works on any UNIX flavor.  Msync() generally
>> seems to do nothing.  Madvise() seems to be useless.  Fsync() after
>> every operation would probably prevent the hiccups, but would make every
>> operation take 10s instead of fractions of milliseconds.
>>  
>>
> I didn't experience any noticeable system performance issues on this 
> machine so far, and I have been very focused on it for the last two 
> weeks.  The database size is well under the 2GB. Right now it is 
> 1.35GB. I don't know what info I could add to enhance the diagnose on 
> this section.
>
>>> Please notice that the RTT to the server remains low all the time, 
>>> at around 50ms.
>>>   
>>
>>
>> 50 ms is a fairly large RTT for a local server.
>>  
>>
> Right now, it is working, and cdcc info gives me:
>
> ---cut
> 127.0.0.1,-                 RTT-1000 ms  anon
> # *127.0.0.1,-                                               OiComBR 
> ID 1004
> #     100% of 32 requests ok   51.57-1000 ms RTT        50 ms queue wait
> ---/cut
>
>>> Not always, when I manually run the cron-dccd script, the errors stop:
>>>
>>> Mar  8 17:54:22 dcc dccd[4748]: 1.3.30 database /var/dcc/dcc_db 
>>> reopened with 2016 MByte window
>>>   
>>
>>
>> Could the database be growing larger than 2 GByte, and then Linux going
>> into its crazy mode of swapping the mmap() dcc_db and dcc_db.hash files
>> to swap space?  I ask because dbclean run by the cron script will 
>> shrink the file.
>>  
>>
> I don't think so... right now, the database is at 1.35GB, and it is 
> not growing, at least not for the last half hour. I ran the cron 
> script a couple of hours ago, not sure if that should allow it to work 
> without increasing the filesize though.
>
>>> Any help will be greatly appreciated, as we are falling into RBLs 
>>> every other day, due to the eventual lack of DCC service (we allow 
>>> email to pass when the DCC doesn't respond)
>>>   
>>
>>
>> If it is better for the DCC client to fail by blocking mail, then
>> you could add -x to DCCM_ARGS or DCCIFD_ARGS in /var/dcc/dcc_conf.
>> That has two effects.  It turns off the "continue not asking" mechanism
>> so that the DCC client asks every time.  Second, it causes dccm or
>> dccifd (when dccifd is in proxy mode such as a postfix before-queue
>> filter) to tell the local MTA to give the distan client MTA or mail
>> sender a 4yz try-again failure.
>>
>> Perhaps the best thing to do is to run 2 local DCC servers, each
>> flooding the other.  Each should run the cron job (and so dbclean)
>> at different times, and perhaps more than once per day.  Each should 
>> be known in /var/dcc/map files on DCC client systems.
>>
>>
>> Vernon Schryver    vjs@rhyolite.com
>> _______________________________________________
>> DCC mailing list      DCC@rhyolite.com
>> http://www.rhyolite.com/mailman/listinfo/dcc
>>
> Well, we still think that letting email pass when it fails is the 
> lesser of evils.
>
> About the "continue not asking" mechanism, I noticed that sometimes 
> the system just gets out of its idleness and gets back to a responsive 
> status, without any interference. Another thing is that even when I 
> stop/start the service, it keeps counting from when it was. Can I 
> reset the counter? Some command to tell the server: "Hey, try it now, 
> let's see if what I did worked for you".
>
> I am not sure about the second server. Granted, redundancy is always 
> welcome, and it would be nice to have it on DCC as well. However, we 
> have already had 5 million emails a day running here without problems. 
> The server also doesn't necessarily stop responding on peak times, 
> which would also be an indicator of high load problems on dcc process, 
> even though most of the server hardware is not being used.
>
> We are considering a second server, but I am not sure if that will 
> solve the problem, or only hide its effect.
>
> Thanks once more for the attention!
>
> Best Regards,
>
> Breno Moiana.
> ===============
> Haxent Consulting
>
> _______________________________________________
> DCC mailing list      DCC@rhyolite.com
> http://www.rhyolite.com/mailman/listinfo/dcc
>




More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.