dccm running out of file descriptors

Gary Mills mills@cc.UManitoba.CA
Sat Jan 31 23:46:43 UTC 2004


On Fri, Jan 30, 2004 at 07:59:45AM -0700, Vernon Schryver wrote:
> > From: Gary Mills <mills@cc.UManitoba.CA>
> 
> > > What does that `lsof` line mean?  What are the '*' characters?  Do they
> > > mean the socket is bound to port 0 at both ends?  Or does that line
> > > mean the socket is not complete, perhaps because accept() has not been done?
> >
> > I'm not sure.  If `lsof' uses the `netstat' definitions, it means:
> >
> >      IDLE  Idle, opened but not bound.
> 
> That makes no sense to me.  I don't know how you can make a TCP socket
> have two IP addresses but not be bound.

Me too.  /usr/include/inet/tcp.h has the same definition for IDLE.
That would imply that dccm is initiating TCP connections, but I don't
think that's the case.

> Perhaps it is a socket in the
> TCP state Close-Wait, or shut down by the other host and waiting for
> a local close() system call.  A glut of such sockets could be caused
> by a missing close() in some error path somewhere.

I don't think so.  Here's this morning's error messages.  First, eight
of these:

Jan 31 00:58:46 electra dccm[20546]: [ID 109917 mail.error] DCC, mi_rd_cmd: read returned -1: Connection reset by peer

Then, a bit later, lots of these:

Jan 31 01:12:25 electra dccm[20546]: [ID 125918 mail.error] DCC: accept() returned invalid socket (Too many open files), try again

The last one was:

Jan 31 07:30:39 electra dccm[20546]: [ID 925838 mail.error] dcc_mkstemp(/var/dcc/log/031/07/tmp.2tnXSD): Too many open files

after which, I restarted dccm. The `lsof' output before the restart
showed 3358 IDLE TCP sockets.

> > corresponds to that hour of very low e-mail activity.  It may have
> > been the result an I/O overload that began earlier in the evening.
> 
> What was that about? 
> Did whatever it was include any syslog complaints from dccm?

I thought it was the early morning rebuild of an LDAP database, which
saturated a disk for about two hours.  However, I moved the rebuild
to TMPFS, where it completed in about 15 minutes.  Now, I'm suspecting
the nightly backup.  I'll have to check with our data management person
to see when that runs.  `cron-dccd' runs at 03:45, so that can't be it.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-



More information about the DCC mailing list

Contact vjs@rhyolite.com by mail or use the form.