I’ve seen this before with single
(not cluster) jobs. If I just submit one then it can stay idle for ages but if Could anyone from U-W let us know what the
“out of servers” message means. I’ve seen this several times. regards, -ian, PS I know people have asked for this
before (and I realise there are good reasons why it’s difficult) –
but could the –analyze results provide more info about the scheduling
than “for unknown reasons”. From:
condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Rick Lan We had seen similar log
messages. Our setup has preemption disabled (setup recommended by section
3.6.10.5). However, setting to print more debug info shows that, I
believe, the negotiator is not dividing up the leftover "resource pie".
So the condor guys told us to use NEGOTIATOR_CONSIDER_PREEMPTION
= True It did help in our pool. Hope this helps, Rick From:
condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Devaraj Das I am trying to submit 79 jobs through a single submit
file with a “Queue 79”. The jobs remain idle for a long long time
(approx 30 minutes the last time I saw this problem) before getting scheduled
but once one of them starts executing, the others quickly follow. Although
there are more than 79 idle nodes available, these 79 jobs don’t go into
execution for a long time. Any idea why? For example if I do a condor_q
–better-analyze for one of the jobs I see: 258.078: Run analysis summary. Of 84
machines, 0 are rejected by your
job's requirements 0 reject your job
because of their own requirements 1 match but are serving
users with a better priority in the pool 83 match but reject the job
for unknown reasons 0 match but will not
currently preempt their existing job 0 are available to run
your job Here is a snippet of the SchedLog from the submit node: 9/9 18:51:26 (pid:9138) Started shadow for job 257.0
on "<66.196.90.7:32774>", (shadow pid = 26550) 9/9 18:51:31 (pid:9138) Sent ad to central manager
for ddas@xxxxxxxx 9/9 18:51:31 (pid:9138) Sent ad to 1 collectors for
ddas@xxxxxxxx 9/9 18:51:36 (pid:9138) DaemonCore: Command received
via UDP from host <66.196.90.120:55382> 9/9 18:51:36 (pid:9138) DaemonCore: received command
421 (RESCHEDULE), calling handler (reschedule_negotiator) 9/9 18:51:36 (pid:9138) Sent ad to central manager
for ddas@xxxxxxxx 9/9 18:51:36 (pid:9138) Sent ad to 1 collectors for
ddas@xxxxxxxx 9/9 18:51:36 (pid:9138) Called
reschedule_negotiator() 9/9 18:51:42 (pid:9138) Activity on stashed
negotiator socket 9/9 18:51:42 (pid:9138) Negotiating for owner:
ddas@xxxxxxxx 9/9 18:51:42 (pid:9138) Checking consistency running
and runnable jobs 9/9 18:51:42 (pid:9138) Tables are consistent 9/9 18:51:42 (pid:9138) Out of servers - 0 jobs
matched, 79 jobs idle, 1 jobs rejected 9/9 18:51:42 (pid:9138) Activity on stashed
negotiator socket 9/9 18:51:42 (pid:9138) Negotiating for owner:
ddas@xxxxxxxx 9/9 18:51:42 (pid:9138) Checking consistency running
and runnable jobs 9/9 18:51:42 (pid:9138) Tables are consistent 9/9 18:51:42 (pid:9138) Out of servers - 0 jobs
matched, 79 jobs idle, 1 jobs rejected By the way, these jobs belong to the Java universe
(and all nodes have Java) and I was able to successfully run these many jobs
earlier (pretty quickly, without this long startup pause) and only recently I
am seeing this problem. Didn’t restart the cluster yet. Will really
appreciate any help in this regard… Thanks, Devaraj. |