[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Cgroups v2 and memory limits for WLCG sites



Hi Andreas,

Jobs got killed with the "none" CGROUP_MEMEOY_LIMIT_POLICY because our policy killed jobs using more than four times the requested memory.

With the current setting, jobs do not get killed anymore by condor. They get killed by the Out-Of-Memory killer when they reach their memory+swap limits. Yes, some of the jobs use the swap, mostly LHCb jobs. But that is at a low level. With DISABLE_SWAP_FOR_JOB jobs should not use the swap anymore.

The update to HTCondor 23.10 solves the monitoring problem in HTCondor of cache pages being counted in memory usage. Depending on how you want to ensure your memory limits, you need to update.

Cheers,

Matthias


On 8/1/24 15:58, Andreas Haupt wrote:
Hi Matthias,

On Mon, 2024-07-29 at 10:02 +0200, Matthias Schnepf wrote:
Hi,
We moved to memory limits by cgroup v2. The memory usage of jobs with cgroups v2 is higher than with cgroup v1 monitoring due to the counting of the page cache. We would also appreciate it if Condor did not count the page cache in the memory usage of the jobs.

Initially, we had problems with the "none" CGROUP_MEMEOY_LIMIT_POLICY in Condor since the page cache was also accounted for in the memory usage. The CEs killed jobs that used less than the requested amount of memory, but with the page cache included in the memory usage, it "used"Â more than four times the requested amount of memory.
We now set custom cgroup settings:
ÂÂÂÂÂÂÂ CGROUP_MEMORY_LIMIT_POLICY = custom
ÂÂÂÂÂÂÂ CGROUP_HARD_MEMORY_LIMIT_EXPR = 2 * Target.RequestMemory

With that, the page clean gets triggered, and normal-behaving jobs do not get killed. Jobs that need more than two times the requested memory still get killed.


as we suffer from the same problem, I played a bit with those settings. Are jobs really killed at KIT in the end?

It looks like jobs exceeding their requested memory limit start paging (swapping) just like crazy. That's also not desired, if you ask me.

Is the only solution to upgrade to the feature release 23.10 on all execution nodes?

Cheers,
Andreas
-- 
| Andreas Haupt            | E-Mail: andreas.haupt@xxxxxxx
| DESY, Zeuthen            | WWW:    http://www.zeuthen.desy.de/~ahaupt
| Platanenallee 6          | Phone: +49/33762/7-7359
| D-15738 Zeuthen          |







_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/