Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] nodes without cgrouped jobs?
- Date: Fri, 29 Jun 2018 18:42:02 -0700
- From: Stuart Anderson <anderson@xxxxxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] nodes without cgrouped jobs?
> On 06/29/2018 03:16 AM, Thomas Hartmann wrote:
>
> Hi Gred,
>
> many thanks - turning on cgroup delegation seem to do the trick!! :)
>
> With the delegate option on for all controllers in the Condor unit's
> service section [1], the job slices survived all new/restarts of other
> units!
I can confirm that Greg's suggestion to set Delegate=true solved a similar cgroup problem for LIGO. In particular, as discovered by Greg, systemd under certain circumstances (e.g., systemctl daemon-reload and restarting non-condor services) was moving condor user processes in dynamic slots out of their condor created cgroups, which bypassed the resource rules put in place for each dynamic slot.
A cgroup expert shouLD confirm, but according to https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ it looks like Condor should be setting Delegate=yes by default,
"Services must set Delegate=yes for the units they intend to manage subcgroups of. If they create and manipulate cgroups outside of units that have Delegate=yes set, they violate the access contract for control groups."
Kudos to Greg for tracking this down!
P.S. This has only been confirmed with single machine testing so far, but I fully expect it will solve a cluster-wide problem once we restart all of the startd with Delegate=true.
--
Stuart Anderson anderson@xxxxxxxxxxxxxxxx
http://www.ligo.caltech.edu/~anderson