[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor Memory Error



Hi,

just my 2-cent - if you don't want memory limits to be enforced you can set 

CGROUP_MEMORY_LIMIT_POLICY = none

In the execution points config ...

Best
christoph 


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Umut TÃrk" <umut1656@xxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 13. Mai 2026 20:40:07
Betreff: Re: [HTCondor-users] Condor Memory Error


Hello HTCondor Team,

I have a question about unexpected memory-limit enforcement on some jobs running under HTCondor 24.0.3.

HTCondor version and execute-node platform:
`$CondorVersion: 24.0.3 2025-01-03 BuildID: 777902 PackageID: 24.0.3-1 GitSHA: ef02b46e $`
`$CondorPlatform: x86_64_AlmaLinux9 $`

The jobs fail with the following message:

âJob has gone over cgroup memory limit of 9222 megabytes. Last measured usage: 1872 megabytes. Consider resubmitting with a higher request_memory.â

The main concern is the mismatch between the enforced cgroup memory limit and the last measured usage reported by HTCondor.

I found what appears to be a related historical issue that got fixed:

Relevant HTCondor configuration on our side includes:

* `CGROUP_MEMORY_LIMIT_POLICY = hard`
* `CGROUP_POLLING_INTERVAL = 1`

To investigate whether these were genuine OOM events, we checked kernel messages on the execute nodes. We did not find corresponding OOM killer messages in the kernel logs. We also verified separately that real OOM events do appear in kernel logs on these nodes, so the lack of such messages here makes the failure mode unclear.

At the moment, we cannot determine why HTCondor reports that the job exceeded the cgroup memory limit when the last measured usage is substantially lower, and we do not see node-level logs indicating an OOM event.

I tried using HOOK_JOB_EXIT to get a snapshot of the cgroup information in /sys/fs/cgroup/.../[dedicated-cgroup folder for the job], but this folder is destroyed by the time HOOK_JOB_EXIT runs.

Could you help clarify the following?

1. What memory metric is used to decide that a job has exceeded the cgroup memory limit in this case?
2. Can that metric differ significantly from the âLast measured usageâ reported in the hold message?
3. Are there additional starter, startd, or cgroup-related logs or debug settings you would recommend collecting to diagnose this further?
4. Does this sound like a known issue in 24.0.3, or possibly related to the older issue referenced above?

Thank you for any guidance.

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/