[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor jobs & cgroups



Hi Greg,

Just a note to say that we have done some testing on our small batch cluster. We installed 23.7.2 a few days ago and that has now upgraded to 23.8.1.

Initial indications are that things are looking better: we’ve found no cases of users jobs escaping from cgroup confinement and machines which were previously experiencing kernel panics when things went wrong, are now seeing errant processes getting oom’ed. We can imagine that more testing will probably need to be carried out by others, but the early signs seem to be positive.

 

Thanks,

 

Dan Whitehouse

Imperial College HEP.

 

From: Greg Thain <gthain@xxxxxxxxxxx>
Date: Friday, 31 May 2024 at 18:50
To: Whitehouse, Dan <d.whitehouse@xxxxxxxxxxxxxx>, HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor jobs & cgroups

This email from gthain@xxxxxxxxxxx originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list to disable email stamping for this address.

 

On 5/31/24 10:29, Whitehouse, Dan wrote:

Hi Greg,

Thanks very much for your prompt reply.

Could you advise whether this fix is likely to be backported to 23.0.12?

We are a little reticent about deploying feature releases but we aren’t going to rule that out and I may well end up taking your advice.  However if the fix is likely to be ported to the next LTS we may prefer to wait.

 

Hi Dan:

I appreciate your caution with respect to upgrades.  I'm not against backporting this, but I think I'd like to see it get more time being used in the field in real world usage before backporting to stable, so I can't say when it might go in.

 

-greg

 

 

Thanks,

 

Dan

 

On 5/31/24 07:31, Whitehouse, Dan wrote:

Hi, we are running htcondor  ($CondorVersion: 23.0.10 2024-05-09 BuildID: 731952 PackageID: 23.0.10-1 $) on Rocky Linux el9.
We have noticed that some jobs appear to be escaping their cgroup. We expect the jobs to run within the “htcondor” cgroup tree, but instead we see some jobs running under the condor.service cgroup tree.

 

Hi Dan:

We've recently fixed some race conditions with cgroup v2 systems.  This code is in 23.7 -- would it be possible for you to try with this version?  Also note that in 23.8, in order to properly comply with the systemd "one writer" rule, the cgroup v2 htcondor tree will be rooted under the condor.service tree created by systemd for the daemons.

-greg