On 2/24/2021 6:31 AM, David Cohen
wrote:
Hi,
I was under the, apparently wrong, impression that setting
CGROUP_MEMORY_LIMIT_POLICY = HARD
will suffice to kill jobs running over the requested
memory.
I now understand that I have to back it up by a
SYSTEM_PERIODIC_HOLD
Hi David,
How did you arrive at the conclusion that you need to do anything
more than setting CGROUP_MEMORY_LIMIT_POLICY=HARD to have jobs
placed on hold if they exceed the memory allocated to the slot?
As Christoph stated early, that should be sufficient assuming you
are running the HTCondor services with root privileges (i.e. as a
system service) and you have BASE_CGROUP defined in your config (it
is defined by default....did you change it?). ÂÂ
While I'm at it can I also use that method to remove jobs
that are using more cores than requested (cpu usage > cpu
requested)?
Assuming HTCondor is launched as root, it will automatically
restrict CPU usage of jobs (using Linux cgroups) to not exceed the
number of cores in the slot when there is contention for the cores.Â
That is, on an eight core machine, with
only a single, one-core slot running, and otherwise idle, the job
running in the one slot could consume all eight cpus concurrently.Â
If, however, all
eight slots where running jobs, with each configured for one cpu,
the
cpu usage would be assigned equally to each job, regardless of the
number of processes or threads in each job.
Because of this, few administrators see the need to stop jobs using
more cores than requested, because the only scenario this could
happen is if no other user was impacted and the cores would
otherwise go idle. If for some reason you still want to do this,
you may find the HOWOTO at this page useful:
Â
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitCpuUsage
(specifically Option 3 on this page).
Hope the above helps
Todd