I am trying to submit 79 jobs through a single submit file
with a “Queue 79”. The jobs remain idle for a long long time (approx
30 minutes the last time I saw this problem) before getting scheduled but once
one of them starts executing, the others quickly follow. Although there are more
than 79 idle nodes available, these 79 jobs don’t go into execution for a
long time. Any idea why? For example if I do a condor_q –better-analyze
for one of the jobs I see: 258.078: Run analysis summary. Of 84 machines, 0 are rejected by your job's
requirements 0 reject your job because of
their own requirements 1 match but are serving users
with a better priority in the pool 83 match but reject the job for
unknown reasons 0 match but will not
currently preempt their existing job 0 are available to run your
job Here is a snippet of the SchedLog from the submit node: 9/9 18:51:26 (pid:9138) Started shadow for job 257.0 on
"<66.196.90.7:32774>", (shadow pid = 26550) 9/9 18:51:31 (pid:9138) Sent ad to central manager for ddas@xxxxxxxx 9/9 18:51:31 (pid:9138) Sent ad to 1 collectors for ddas@xxxxxxxx 9/9 18:51:36 (pid:9138) DaemonCore: Command received via UDP
from host <66.196.90.120:55382> 9/9 18:51:36 (pid:9138) DaemonCore: received command 421
(RESCHEDULE), calling handler (reschedule_negotiator) 9/9 18:51:36 (pid:9138) Sent ad to central manager for ddas@xxxxxxxx 9/9 18:51:36 (pid:9138) Sent ad to 1 collectors for ddas@xxxxxxxx 9/9 18:51:36 (pid:9138) Called reschedule_negotiator() 9/9 18:51:42 (pid:9138) Activity on stashed negotiator
socket 9/9 18:51:42 (pid:9138) Negotiating for owner: ddas@xxxxxxxx 9/9 18:51:42 (pid:9138) Checking consistency running and
runnable jobs 9/9 18:51:42 (pid:9138) Tables are consistent 9/9 18:51:42 (pid:9138) Out of servers - 0 jobs matched, 79
jobs idle, 1 jobs rejected 9/9 18:51:42 (pid:9138) Activity on stashed negotiator
socket 9/9 18:51:42 (pid:9138) Negotiating for owner: ddas@xxxxxxxx 9/9 18:51:42 (pid:9138) Checking consistency running and
runnable jobs 9/9 18:51:42 (pid:9138) Tables are consistent 9/9 18:51:42 (pid:9138) Out of servers - 0 jobs matched, 79
jobs idle, 1 jobs rejected By the way, these jobs belong to the Java universe (and all
nodes have Java) and I was able to successfully run these many jobs earlier (pretty
quickly, without this long startup pause) and only recently I am seeing this
problem. Didn’t restart the cluster yet. Will really appreciate any help
in this regard… Thanks, Devaraj. |