RE: [Condor-users] Explaining the Claimed + Idle state


Date: Mon, 7 Feb 2005 23:09:21 -0500
From: "Ian Chesal" <ICHESAL@xxxxxxxxxx>
Subject: RE: [Condor-users] Explaining the Claimed + Idle state
> My user's schedd went to >90 jobs sure enough it began to 
> report more jobs running that there really were (by quite a 
> bit). So I've made the recommended registry change to the 
> machine and rebooted it. It's in a good state now: condor_q 
> -name and condor_status -sub are showing the same queued and 
> running count for the user. The number of jobs running is 
> only 48 right now. I'll keep an eye on it for the next couple 
> of hours.

The change in the registry setting has not stopped the appearance of
condor_write errors in the SchedLog for the machine:

2/7 21:19:47 Activity on stashed negotiator socket
2/7 21:19:47 Negotiating for owner: eahmed@xxxxxxxxxx
2/7 21:19:47 Checking consistency running and runnable jobs
2/7 21:19:47 Tables are consistent
2/7 21:24:54 Out of servers - 175 jobs matched, 1050 jobs idle, 1050
jobs rejected
2/7 21:25:43 condor_read(): timeout reading buffer.
2/7 21:27:02 condor_read(): timeout reading buffer.
2/7 21:27:07 Sent ad to central manager for eahmed@xxxxxxxxxx
2/7 21:27:07 Sent ad to 1 collectors for eahmed@xxxxxxxxxx
2/7 21:27:07 condor_write(): Socket closed when trying to write buffer
2/7 21:27:07 Buf::write(): condor_write() failed
2/7 21:27:07 SECMAN: Error sending response classad!
2/7 21:27:07 Sent RELEASE_CLAIM to startd on <137.57.142.60:1039>

These are the first sign of impending trouble (as report to the
condor-admin list in report #11869) -- the schedd's idea of running jobs
will now slowly grow and grow until it's reporting far more jobs running
than physically possible. I'm writing this sentence at about 9:00 pm
EST.

And after waiting for 90 minutes (It is now about 10:45 pm EST) I find
his ttc-eahmed3 machine attempting to run 90++ jobs simulatenously. The
condor_status -sub output is now:

Name                 Machine      Running IdleJobs HeldJobs

eahmed@xxxxxxxxxx    TTC-EAHMED         0        0        0
eahmed@xxxxxxxxxx    TTC-EAHMED        93     2904        0

                           RunningJobs           IdleJobs
HeldJobs

   eahmed@xxxxxxxxxx                93               2904
0

And condor_q on ttc-eahmed3 is returning:

2978 jobs; 2884 idle, 94 running, 0 held


Note the widening gap between # of jobs idle reported by condor_status
and condor_q on the schedd machine. As I've observed in the past, this
will get wider, with the schedd machine reporting fewer and fewer idle
jobs with a slightly increasing number of running jobs. And
condor_status reporting the same number of idle jobs from now on and
fewer and fewer running jobs. Presumably the connection between the
master and the scheed is now difficult to maintain as the SchedLog file
for ttc-eahmed3 is littered with Buf::write(): condor_write() failed
errors now.

And I have many machine in system that have ttc-eahmed3 as the
ClientMachine but have the Claimed+Idle state and have been there for
many seconds (the ones over 1200 seconds are the ones that concern me,
they'll stay that way until I manually vacate them):

Mon Feb  7 22:47:26 2005

EnteredCurrentActivity	ClientMachine	Name	SecondsInCurrentActivity
1107834424      TTC-EAHMED3.altera.com
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx       22
1107832442      TTC-EAHMED3.altera.com
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx       2004
1107832261      TTC-EAHMED3.altera.com
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx       2185
1107834344      TTC-EAHMED3.altera.com
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  102
1107832350      TTC-EAHMED3.altera.com
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   2096
1107833487      TTC-EAHMED3.altera.com
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx      959
1107833571      TTC-EAHMED3.altera.com
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx    875
1107833266      TTC-EAHMED3.altera.com  vm2@xxxxxxxxxxxxxxxxxxxxxxxxx
1180
1107834430      TTC-EAHMED3.altera.com  vm1@xxxxxxxxxxxxxxxxxxxxxxxxx
16
1107832381      TTC-EAHMED3.altera.com  vm2@xxxxxxxxxxxxxxxxxxxxxxxxx
2065
1107832393      TTC-EAHMED3.altera.com  vm1@xxxxxxxxxxxxxxxxxxxxxxxxx
2053
1107832641      TTC-EAHMED3.altera.com  vm1@xxxxxxxxxxxxxxxxxxxxxxxxx
1805
1107832145      TTC-EAHMED3.altera.com  vm2@xxxxxxxxxxxxxxxxxxxxxxxxx
2301
1107834427      TTC-EAHMED3.altera.com  vm1@xxxxxxxxxxxxxxxxxxxxxxxxx
19
1107834417      TTC-EAHMED3.altera.com  vm2@xxxxxxxxxxxxxxxxxxxxxxxxx
29
1107832312      TTC-EAHMED3.altera.com  vm1@xxxxxxxxxxxxxxxxxxxxxxxxx
2134
1107833893      TTC-EAHMED3.altera.com  vm2@xxxxxxxxxxxxxxxxxxxxxxxxx
553
1107832189      TTC-EAHMED3.altera.com  vm2@xxxxxxxxxxxxxxxxxxxxxxxxx
2257
1107833388      TTC-EAHMED3.altera.com
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1058
1107833287      TTC-EAHMED3.altera.com
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1159
1107832282      TTC-EAHMED3.altera.com
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx        2164
1107834253      TTC-EAHMED3.altera.com
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 193
1107834400      TTC-EAHMED3.altera.com
vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx        46
1107834416      TTC-EAHMED3.altera.com
vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 30

So it would appear that despite the registry setting being made I'm
hitting a hard upper limit on the number of concurrent running vanilla
jobs that can be supported by a single schedd instance in 6.7.3. Does
that analysis seem apt?

- Ian



[← Prev in Thread] Current Thread [Next in Thread→]