We had seen similar log messages. Our setup has preemption
disabled (setup recommended by section 3.6.10.5). However, setting to print more
debug info shows that, I believe, the negotiator is not dividing up
the leftover "resource pie". So the condor guys told us to
use
NEGOTIATOR_CONSIDER_PREEMPTION = True
It did help in our pool.
Hope
this helps,
Rick
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Devaraj Das Sent: Saturday, September 09, 2006 12:25 PM To: condor-users@xxxxxxxxxxx Subject: [Condor-users] Weird problem (with condor-6.8.0) I am trying to submit 79 jobs
through a single submit file with a “Queue 79”. The jobs remain idle for a long
long time (approx 30 minutes the last time I saw this problem) before getting
scheduled but once one of them starts executing, the others quickly follow.
Although there are more than 79 idle nodes available, these 79 jobs don’t go
into execution for a long time. Any idea why? For example if I do a condor_q
–better-analyze for one of the jobs I see: 258.078: Run analysis
summary. Of 84 machines, 0 are
rejected by your job's requirements 0
reject your job because of their own requirements 1
match but are serving users with a better priority in the
pool 83 match
but reject the job for unknown reasons 0
match but will not currently preempt their existing
job 0 are
available to run your job Here is a snippet of the SchedLog
from the submit node: 9/9 18:51:26 (pid:9138) Started
shadow for job 257.0 on "<66.196.90.7:32774>", (shadow pid =
26550) 9/9 18:51:31 (pid:9138) Sent ad to
central manager for ddas@xxxxxxxx 9/9 18:51:31 (pid:9138) Sent ad to 1
collectors for ddas@xxxxxxxx 9/9 18:51:36 (pid:9138) DaemonCore:
Command received via UDP from host
<66.196.90.120:55382> 9/9 18:51:36 (pid:9138) DaemonCore:
received command 421 (RESCHEDULE), calling handler
(reschedule_negotiator) 9/9 18:51:36 (pid:9138) Sent ad to
central manager for ddas@xxxxxxxx 9/9 18:51:36 (pid:9138) Sent ad to 1
collectors for ddas@xxxxxxxx 9/9 18:51:36 (pid:9138) Called
reschedule_negotiator() 9/9 18:51:42 (pid:9138) Activity on
stashed negotiator socket 9/9 18:51:42 (pid:9138) Negotiating
for owner: ddas@xxxxxxxx 9/9 18:51:42 (pid:9138) Checking
consistency running and runnable jobs 9/9 18:51:42 (pid:9138) Tables are
consistent 9/9 18:51:42 (pid:9138) Out of
servers - 0 jobs matched, 79 jobs idle, 1 jobs
rejected 9/9 18:51:42 (pid:9138) Activity on
stashed negotiator socket 9/9 18:51:42 (pid:9138) Negotiating
for owner: ddas@xxxxxxxx 9/9 18:51:42 (pid:9138) Checking
consistency running and runnable jobs 9/9 18:51:42 (pid:9138) Tables are
consistent 9/9 18:51:42 (pid:9138) Out of
servers - 0 jobs matched, 79 jobs idle, 1 jobs
rejected By the way, these jobs belong to the
Java universe (and all nodes have Java) and I was able to successfully run these
many jobs earlier (pretty quickly, without this long startup pause) and only
recently I am seeing this problem. Didn’t restart the cluster yet. Will really
appreciate any help in this regard… Thanks, Devaraj. |