Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Lots of TIME_WAIT sockets killing server
- Date: Thu, 3 Jun 2010 09:59:41 +0200
- From: "J.A. Gutierrez" <spd@xxxxxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Lots of TIME_WAIT sockets killing server
> > if condor is active for a couple of days, the condor master host
> > gets its connection table filled with thousands of "TIME_WAIT"
> > sockets, so no new connections can be opened and the server
> > (which also acts as central NFS/NIS+ server) gets killed.
> >
More information:
I started condor master 24 hours ago.
Now, `condor_status` shows 3 Linux clients and 1 Sparc client.
`condor_q` shows 20 jobs queued, and two of them are running
(with 6 an 18 CPU hours each).
---------------------------------------------------------------------------
18.0 ***** 6/2 11:15 0+06:03:07 I 0 219.7 convert.sh
18.1 ***** 6/2 11:15 0+19:01:04 R 0 219.7 convert.sh
18.2 ***** 6/2 11:15 0+00:00:00 I 0 0.0 convert.sh
...
---------------------------------------------------------------------------
The log file for job 18.1 has 278 lines saying:
---------------------------------------------------------------------------
010 (018.001.000) 06/03 06:36:40 Job was suspended.
---------------------------------------------------------------------------
(the host which started the job was shutdown)
On the master,
netstat -an | egrep "17.14.*TIME_WAIT" | tail -1
---------------------------------------------------------------------------
***.***.***.***.32772 10.3.17.14.607 5888 0 24616 0 TIME_WAIT
---------------------------------------------------------------------------
and `netstat -an | egrep "17.14.*TIME_WAIT" | wc -l` gives "211"
and gets incremented every few minutes...
The host 10.3.17.14 is up; and joined the condor pool, but is not
executing any job.
At this point, if I try to login in 10.3.17.14 as user (automounting
$HOME via NFS from master host), I can't, since automount can't
mount $HOME (manually mounting directories from master still works)
automount process on the client host it's unresponding. I can't
even kill it with -TERM, I have to use -KILL
After restarting automount on the client, I can login as user,
but in the meanwhile, "TIME_WAIT" sockets on server has grown to
412.
But, shortly after that, "TIME_WAIT" sockets for 10.3.17.14 have
gone! (being replaced by 217 similar sockets to 10.3.17.12).
The port on the server is always "32772", which is assigned to
rpc.nisd (NIS+ service daemon)....
So, I must conclude it's a problem with linux NIS+ client/autmount
which is triggered by condor only; but I can't imagine how.
It seems a very dirty workaround could be monitoring TIME_WAIT
sockets on the master, and restarting automountd on the hosts
with lots of TIME_WAIT sockets, but I'd like to find a better
solution.
--
PGP and other useless info at \
http://webdiis.unizar.es/~spd/ \
finger://daphne.cps.unizar.es/spd \ Timeo Danaos et dona ferentes
ftp://ivo.cps.unizar.es/pub/ \ (Virgilio)