Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Controlling Jobs - Disabling/preventing a job from being suspended
- Date: Wed, 25 Sep 2013 12:32:16 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Controlling Jobs - Disabling/preventing a job from being suspended
On 9/24/2013 7:36 PM, Andrey Kuznetsov wrote:
Hi,
The nodes are setup to suspend jobs based on various conditions.
I'm wondering is it possible to pass an argument or setting in condor
submit file that disables SUSPEND mode?
I have a job that eats up a lot of RAM and cannot be checkpointed, so I
want to make sure that it finishes running no matter.
I also do not want to modify the node's suspend settings unless it's to add
a conditional statement that I can then manipulate from the condor submit
file. (don't know if that's possible)
I assume you are running HTCondor v8.0.x or above... (you didn't say)
The WANT_SUSPEND knob in the config file controls is suspension will
happen on a node. The MAXJOBRETIREMENTTIME knob specifies a time X in
seconds that essentially says "do not preempt or kill this job for any
reason until it has run for at least X seconds". Both of these knobs
are ClassAd expressions that can refer to any attribute in the job ad
(including custom attributes).
So for the above, you could put in condor_config.local (on all machines):
# If job attribute DoNotSuspendJob is explicitly set to True,
# then do not allow suspend mode. If job does not say one way or
# the other, allow suspend mode.
WANT_SUSPEND = DoNotSuspendJob =!= True
# If job Attribute DoNotKillJob is explicitly set to True, then
# never interrupt the job unless it has ran for more than
# 600k seconds (ie a week, just to catch runaway jobs).
MAXJOBRETIREMENTTIME = 604800 * ( DoNotKillJob =?= True )
After doing the above, don't forget to do "condor_reconfig -all" from
your central manager.
Now you could submit jobs w/ a submit file like so (note the + sign to
insert a custom attribute):
# Disable suspend mode, but job will still be preempted when
# preempt becomes true
executable = foo
+DoNotSuspendJob = True
queue
Or like this:
# Disable suspend mode AND do not kill/preempt for any reason
# once job is started, unless job has already run for a week
executable = bar
+DoNotSuspendJob = True
+DoNotKillJob = True
queue
Take these examples w/ a grain of salt, they are off the top of my head,
but should point you in the right direction. You can get more
information from the Manual, look up WANT_SUSPEND and/or
MAXJOBRETIREMENTTIME in the index.
BTW, maybe instead of DoNotKillJob being True/False, you could just have
an integer attribute and allow the job submit file to specify how long
it can run w/o any interruption instead of hard-coding to a week...
regards,
Todd