Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] claimed slots are idle
- Date: Tue, 26 Jan 2010 10:15:14 -0800
- From: dalonso <dalonso@xxxxxxxxxxxxxxxx>
- Subject: [Condor-users] claimed slots are idle
We are seeing an erratic problem on our cluster and wondered if this
rings any bells with any of you.
Summary
(0) The condor queue satisfactorily submits and runs jobs for hours or
days, i.e.:
jobs get queued and run to completion, then other jobs take the
vacated slots.
(1) Then after some time (hours or days) we start noticing that
claimed slots
aren't running running jobs: i.e.:
condor_status -claimed shows load of 0.000 on a bunch (e.g. below)
of slots and no jobs are running on those nodes.
These slots are never released and never show up as un-claimed and
never have running jobs.
Initially there will be a mixture of working("claimed and busy") nodes
and futile("claimed and idle" nodes, but the situation escalates to
the point
that (almost?) all of the slots are claimed and idle,
and the load average on the entire cluster is near zero.
I need to confirm the following two:
- No shadow tasks run on Master for the claimed idle slots
- No shadow or user tasks run on the nodes associated with the claimed
slots
Other Info.
* the nodes don't crash
*[root@vic ~]# condor_version
$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $
$CondorPlatform: X86_64-LINUX_RHEL5 $
* Restarting the condor daemon on the claimed idle node does not fix
the problem.
To fix the problem we have to stop condor on nodes, stop on master,
clean spool directory , start on master, start on nodes.
* All our routing, ping, name resolution, and portscan tests from
working and non-working
clients and the master look normal.
* NFS work dirs.
* no abnormal loads on the NFS servers
* file and directory access on the work dirs is not compromised (ls
and find run fast).
* Example from condor_status -claimed: slot2@vic100. LINUX
X86_64 0.000 some_user@vic vic.cluster
Darwin O.V. Alonso
dalonso@xxxxxxxxxxxxxxxx
Dept. Biochem. J558(HSB)
University of Washington
1705 NE Pacific St
Seattle WA 98195-7350