Hello HTCondor Team,
I have a question about unexpected
memory-limit enforcement on some jobs running
under HTCondor 24.0.3.
HTCondor version and execute-node
platform:
`$CondorVersion: 24.0.3 2025-01-03
BuildID: 777902 PackageID: 24.0.3-1 GitSHA:
ef02b46e $`
`$CondorPlatform: x86_64_AlmaLinux9
$`
The jobs fail with the following
message:
âJob has gone over cgroup memory
limit of 9222 megabytes. Last measured usage: 1872
megabytes. Consider resubmitting with a higher
request_memory.â
The main concern is the mismatch
between the enforced cgroup memory limit and the
last measured usage reported by HTCondor.
I found what appears to be a related
historical issue that got fixed:
Relevant HTCondor configuration on
our side includes:
* `CGROUP_MEMORY_LIMIT_POLICY =
hard`
* `CGROUP_POLLING_INTERVAL = 1`
To investigate whether these were
genuine OOM events, we checked kernel messages on
the execute nodes. We did not find corresponding
OOM killer messages in the kernel logs. We also
verified separately that real OOM events do appear
in kernel logs on these nodes, so the lack of such
messages here makes the failure mode unclear.
At the moment, we cannot determine
why HTCondor reports that the job exceeded the
cgroup memory limit when the last measured usage
is substantially lower, and we do not see
node-level logs indicating an OOM event.
I tried using HOOK_JOB_EXIT to get a
snapshot of the cgroup information in
/sys/fs/cgroup/.../[dedicated-cgroup folder for
the job], but this folder is destroyed by the time
HOOK_JOB_EXIT runs.
Could you help clarify the
following?
1. What memory metric is used to
decide that a job has exceeded the cgroup memory
limit in this case?
2. Can that metric differ
significantly from the âLast measured usageâ
reported in the hold message?
3. Are there additional starter,
startd, or cgroup-related logs or debug settings
you would recommend collecting to diagnose this
further?
4. Does this sound like a known
issue in 24.0.3, or possibly related to the older
issue referenced above?
Thank you for any guidance.