RE: [Condor-users] Explaining the Claimed + Idle state

Date: Mon, 7 Feb 2005 23:09:21 -0500
From: "Ian Chesal" <ICHESAL@xxxxxxxxxx>
Subject: RE: [Condor-users] Explaining the Claimed + Idle state
> My user's schedd went to >90 jobs sure enough it began to 
> report more jobs running that there really were (by quite a 
> bit). So I've made the recommended registry change to the 
> machine and rebooted it. It's in a good state now: condor_q 
> -name and condor_status -sub are showing the same queued and 
> running count for the user. The number of jobs running is 
> only 48 right now. I'll keep an eye on it for the next couple 
> of hours.

The change in the registry setting has not stopped the appearance of
condor_write errors in the SchedLog for the machine:

2/7 21:19:47 Activity on stashed negotiator socket
2/7 21:19:47 Negotiating for owner: eahmed@xxxxxxxxxx
2/7 21:19:47 Checking consistency running and runnable jobs
2/7 21:19:47 Tables are consistent
2/7 21:24:54 Out of servers - 175 jobs matched, 1050 jobs idle, 1050
jobs rejected
2/7 21:25:43 condor_read(): timeout reading buffer.
2/7 21:27:02 condor_read(): timeout reading buffer.
2/7 21:27:07 Sent ad to central manager for eahmed@xxxxxxxxxx
2/7 21:27:07 Sent ad to 1 collectors for eahmed@xxxxxxxxxx
2/7 21:27:07 condor_write(): Socket closed when trying to write buffer
2/7 21:27:07 Buf::write(): condor_write() failed
2/7 21:27:07 SECMAN: Error sending response classad!
2/7 21:27:07 Sent RELEASE_CLAIM to startd on <>

These are the first sign of impending trouble (as report to the
condor-admin list in report #11869) -- the schedd's idea of running jobs
will now slowly grow and grow until it's reporting far more jobs running
than physically possible. I'm writing this sentence at about 9:00 pm

And after waiting for 90 minutes (It is now about 10:45 pm EST) I find
his ttc-eahmed3 machine attempting to run 90++ jobs simulatenously. The
condor_status -sub output is now:

Name                 Machine      Running IdleJobs HeldJobs

eahmed@xxxxxxxxxx    TTC-EAHMED         0        0        0
eahmed@xxxxxxxxxx    TTC-EAHMED        93     2904        0

                           RunningJobs           IdleJobs

   eahmed@xxxxxxxxxx                93               2904

And condor_q on ttc-eahmed3 is returning:

2978 jobs; 2884 idle, 94 running, 0 held

Note the widening gap between # of jobs idle reported by condor_status
and condor_q on the schedd machine. As I've observed in the past, this
will get wider, with the schedd machine reporting fewer and fewer idle
jobs with a slightly increasing number of running jobs. And
condor_status reporting the same number of idle jobs from now on and
fewer and fewer running jobs. Presumably the connection between the
master and the scheed is now difficult to maintain as the SchedLog file
for ttc-eahmed3 is littered with Buf::write(): condor_write() failed
errors now.

And I have many machine in system that have ttc-eahmed3 as the
ClientMachine but have the Claimed+Idle state and have been there for
many seconds (the ones over 1200 seconds are the ones that concern me,
they'll stay that way until I manually vacate them):

Mon Feb  7 22:47:26 2005

EnteredCurrentActivity	ClientMachine	Name	SecondsInCurrentActivity
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx       22
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx       2004
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx       2185
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  102
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   2096
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx      959
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx    875
1107833266  vm2@xxxxxxxxxxxxxxxxxxxxxxxxx
1107834430  vm1@xxxxxxxxxxxxxxxxxxxxxxxxx
1107832381  vm2@xxxxxxxxxxxxxxxxxxxxxxxxx
1107832393  vm1@xxxxxxxxxxxxxxxxxxxxxxxxx
1107832641  vm1@xxxxxxxxxxxxxxxxxxxxxxxxx
1107832145  vm2@xxxxxxxxxxxxxxxxxxxxxxxxx
1107834427  vm1@xxxxxxxxxxxxxxxxxxxxxxxxx
1107834417  vm2@xxxxxxxxxxxxxxxxxxxxxxxxx
1107832312  vm1@xxxxxxxxxxxxxxxxxxxxxxxxx
1107833893  vm2@xxxxxxxxxxxxxxxxxxxxxxxxx
1107832189  vm2@xxxxxxxxxxxxxxxxxxxxxxxxx
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1058
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1159
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx        2164
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 193
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx        46
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 30

So it would appear that despite the registry setting being made I'm
hitting a hard upper limit on the number of concurrent running vanilla
jobs that can be supported by a single schedd instance in 6.7.3. Does
that analysis seem apt?

- Ian

[← Prev in Thread] Current Thread [Next in Thread→]