Subject: Re: [HTCondor-users] Killing job instead of putting on hold when Memory is exhausted
From: Tomas Kouba <tomas.kouba@xxxxxxx> Date: 05/23/2016 04:40 AM
> Hello,
>
> I have configured HTCondor to run jobs with limited amount of memory
via cgroups.
> Now I am testing what happens with jobs that allocate too much:
> - put on Hold
> - HoldReason = "Error from slot1@<node>: Job has gone over
memory limit"
>
> Is it possible to tell HTCondor to kill the job instead of putting
jobs on hold?
> (actually I would prefer killing jobs instead of holding under all
> circumstances, not only memory
> exhaustion).
The hold action is tied to the cgroup OOM killer so
it's not under user governance, but you can implement your desired policy by setting
a "periodic_remove" _expression_. For example, if your JobStatus is 5 (held),
check for a HoldReasonCode (page 955 of the 8.4.6 manual) of 34 which indicates
a memory limit was hit. If both conditions are true, you'd set periodic_remove
to true, and then the job will exit the queue at the next interval after being held
due to memory exhaustion.
My own system_periodic_remove _expression_ allows held
jobs to stay in the queue for up to five days keyed from the CompletionDate attribute,
so that the users can see them and adjust their submissions accordingly but
without requiring the users to manually clean up after themselves (since we all
know how that usually goes).
One thing you might consider, however, is using the
34 code to resize the memory of the job by some factor and allowing it to restart.
You can do this with a periodic_release _expression_ coupled with a request_memory
_expression_ that sets the memory request to either the baseline value for
the job or some percentage increase of the memory used in the last run where
the memory was exhausted, allowing the job to claim more memory at each run
until it's able to finish successfully. To limit the number of attempts, you'd
incorporate NumJobStarts in the _expression_.