Hello HTCondor Team,
I have a question about unexpected memory-limit enforcement on some jobs running under HTCondor 24.0.3.
HTCondor version and execute-node platform:
`$CondorVersion: 24.0.3 2025-01-03 BuildID: 777902 PackageID: 24.0.3-1 GitSHA: ef02b46e $`
`$CondorPlatform: x86_64_AlmaLinux9 $`
The jobs fail with the following message:
âJob has gone over cgroup memory limit of 9222 megabytes. Last measured usage: 1872 megabytes. Consider resubmitting with a higher request_memory.â
The main concern is the mismatch between the enforced cgroup memory limit and the last measured usage reported by HTCondor.
I found what appears to be a related historical issue that got fixed:
Relevant HTCondor configuration on our side includes:
* `CGROUP_MEMORY_LIMIT_POLICY = hard`
* `CGROUP_POLLING_INTERVAL = 1`
To investigate whether these were genuine OOM events, we checked kernel messages on the execute nodes. We did not find corresponding OOM killer messages in the kernel logs. We also verified separately that real OOM events do appear in kernel logs on these nodes, so the lack of such messages here makes the failure mode unclear.
At the moment, we cannot determine why HTCondor reports that the job exceeded the cgroup memory limit when the last measured usage is substantially lower, and we do not see node-level logs indicating an OOM event.
I tried using HOOK_JOB_EXIT to get a snapshot of the cgroup information in /sys/fs/cgroup/.../[dedicated-cgroup folder for the job], but this folder is destroyed by the time HOOK_JOB_EXIT runs.
Could you help clarify the following?
1. What memory metric is used to decide that a job has exceeded the cgroup memory limit in this case?
2. Can that metric differ significantly from the âLast measured usageâ reported in the hold message?
3. Are there additional starter, startd, or cgroup-related logs or debug settings you would recommend collecting to diagnose this further?
4. Does this sound like a known issue in 24.0.3, or possibly related to the older issue referenced above?
Thank you for any guidance.