Hi Greg,our general idea was to reserve on execution points 1-2 core weight equivalents plus a bit memory for system processes, which should not have a real performance impact with SMT enabled.
But after another discussion, we might more gravitate to move reweighting/reserving resource shares into the condor group. Specific thing is, that we occasionally observer execution point startds being overloaded with calculating slot weights. When spending all time calculating slot weights, such a node can become absent from its collector. Although not solving the underlying issue (special user jobs), it might be an option to reserve core weights and mem for the whole condor group, but reweight job requirements in a transformation so that the integrated job children cgroup core weights and memory are not at 100% but leave 1-2% for the startd (and other processes).
Cheers, Thomas On 16/02/2024 20.34, Greg Thain via HTCondor-users wrote:
On 2/16/24 05:12, Thomas Hartmann wrote:Hi all,I would like to enforce also under cgroups v2 memory limits around 95% of the total memory. However, I am not sure, how Condors OOM watchdog would react to it?Hi Thomas:I'm not quite sure what your requirements are here -- are you OK if any one job goes over the per-slot memory limit, but only care if all the jobs, in total, go over some limit?-greg
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature