If so, would something like the following, (based on examples from the
wiki page), in an environment with cgroups enabled, place a job on hold
when the job process tree allocates more resident memory than in the
request_memory submit file attribute?
# Allow jobs to not be limited by request_memory otherwise
# this policy can never be triggered
CGROUP_MEMORY_LIMIT_POLICY=none
# hold jobs that are more than 10% over requested memory
MEMORY_EXCEEDED = ((MemoryUsage*1.1 > request_memory) =!= TRUE)
PREEMPT = $(PREEMPT)) || $(MEMORY_EXCEEDED)
WANT_SUSPEND = False
WANT_HOLD = $(MEMORY_EXCEEDED)
WANT_HOLD_REASON = ifThenElse( $(MEMORY_EXCEEDED), \
"Your job used more resident memory than it requested.", \
undefined )
Without actually testing the above, off the top of my head the idea
looks like it should work. Note that the above has a syntax error for
the PREEMPT expression due to unmatched parenthesis - you probably wanted
PREEMPT = ($(PREEMPT)) || $(MEMORY_EXCEEDED)
Also note that jobs will not be preempted until they exhaust their
MaxJobRetirementTime, which is time HTCondor promises to let the job run
without being preempted for any reason. So if you want to immediately
hold jobs that exceed memory usage even if the jobs have specified a
maxjobretirementtime and you are using HTCondor v8.2 or above, you will
want to use the template at
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitCpuUsage
and just replace $(CPU_EXCEEDED) with $(MEMORY_EXCEEDED).
Nice work Roderick, thanks for sharing!