Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] cgroups and OOM killers

Date: Thu, 6 Jul 2023 17:01:10 +0200
From: Mary Hester <maryh@xxxxxxxxx>
Subject: [HTCondor-users] cgroups and OOM killers

Hello HTCondor experts,

We're seeing some interesting behaviour with user jobs on our localHTCondor cluster, running version 9.8.

Basically, if a job in the cgroup manages to go sufficiently over memoryso that the container cannot allocate accountable memory that is neededfor basic functioning of the system asÂ a whole (e.g. to hold itscmdline), then the container has impact on the whole system and willbring it down. This is a worse condition than condor not being able tofully get the status/failure reason for any single specific container.And since oom_kill_disable is set to 1, the kernel will now notintervene and hence the entire system grinds to a halt. It is preferableto loose state for a single job, have the kernel do its thing, and havethe system survive. Now, the only workaround is to run for i in/sys/fs/cgroup/memory/htcondor/condor*/memory.oom_control ; do echo 0 >$i ; done in a loop to ensure the sysadmin-intended settings are appliedto the condor-managed cgroups.

Is there a configurable setting for oom_kill_disable 0? Shouldn't thisbe an option or was there another reason for the oom_kill_disable beingset to 1?


Thanks,

Mary

Follow-Ups:
- Re: [HTCondor-users] cgroups and OOM killers
  - From: Greg Thain

Prev by Date: [HTCondor-users] 2023 European HTCondor Workshop - abstract submission is now open
Next by Date: Re: [HTCondor-users] NVIDIA L40 Identified as an OCL Device
Previous by thread: [HTCondor-users] 2023 European HTCondor Workshop - abstract submission is now open
Next by thread: Re: [HTCondor-users] cgroups and OOM killers
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] cgroups and OOM killers