Although what I've found is that I do want my vanilla jobs to suspend when the load average - especially by other users. But I don't want them to be killed, ever.
I changed SUSPEND_VANILLA=True, and I'm hoping that will ensure vanilla jobs will always get suspended, not killed, when condor decides preempting or vacating may be necessary.
On Thu, Dec 9, 2010 at 1:32 PM, Matthew Farrellee
<matt@xxxxxxxxxx> wrote:
On 12/09/2010 01:03 PM, Erik Aronesty wrote:
I'm very new to condor, and although I seem to have gotten it working
(one sumbit node, 6 compute nodes, 36 slots), and am running jobs, I
have a couple questions:
1. Where can i look to find out precisely why jobs are vacating and
restarting?
2. For now, I'm using dedicated machines... and thus I don't want
vanilla jobs to "vacate/kill/die" since it just means they get
restarted... usually 90% of the way through them. I haven't tried,
yet, compiling with condor libs and running standard universe jobs...
but i'd like the config to be done nicely for them). If a job without
checkpointing is preempted, or if the cpu gets busy, I'd like it to
SUSPEND, never vacate.
Here's my relevant configs I can think of. I think perhaps
the KILL_VANILLA and VACATE_VANILLA won't do what I expect, and condor
may use "more drastic measures" anyway (although I'm not sure what "more
drastic" means).
SUSPEND = $(CPUBusy)
WANT_SUSPEND = True
MAXVACATETIME = 20 * $(MINUTE)
VACATE = $(ActivityTimer) > $(MaxSuspendTime)
VACATE_VANILLA = False
WANT_VACATE = True
KILL = $(UWCS_KILL)
KILL_VANILLA = False
PREEMPT = $(UWCS_PREEMPT)
PREEMPT_VANILLA = False
Yet I still get stuff like when looking at the queue:
LastVacateTime = 1291916587
and this when grepping the logs...
Changing state and activity: Claimed/Idle -> Preempting/Vacating
The StartLog is going to be your best source of information, though you may need to change START_DEBUG = D_FULLDEBUG.
Since your resources are dedicated and you are starting out, you might be interested in starting with a very simple policy that allows everything to run to completion. You can build from it later.
START = TRUE
WANT_SUSPEND = FALSE
WANT_VACATE = FALSE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE
Above is config that should be set on any nodes running a condor_startd.
Now, there's one other thing to consider: preemption. You can disable it with,
PREEMPTION_REQUIREMENTS = FALSE
RANK = 0
which is configuration for the machine running your condor_negotiator.
You can read about all this in the manual,
http://www.cs.wisc.edu/condor/manual/v7.5/3_5Policy_Configuration.html
Best,
matt