[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor Memory Error



Hi,
I have observed this issue recently and asked about it:

https://www-auth.cs.wisc.edu/lists/htcondor-users/2026-February/msg00067.shtml

I went to a meeting with devs and their summary (by Tom Smith) is this:


It's not completely unusual to see the message that "Job has gone over cgroup memory limit of X. Last measured usage is Y", where the value of Y is much less than X

What can happen is condor is tracking the job's memory usage every 5 seconds, so in that time it is possible to use enough memory that it exceeds the limit and the cgroup kills it the moment it goes a byte over.

If jobs consume memory very slowly the displayed usage value will be pretty close to the limit, but if the memory explodes it can be quite different (many GB)

In this case the "fix" is to allocate more memory for the job, or perhaps investigate why the job's executable is using more memory than you think it should be using (and very quickly)

The second case we talked about is the case where the worker node running the jobs ran out of memory, and the system itself started to OOM kill things. Condor jobs are pretty attractive for the OOM killer so they are usually the first to go.

In this case you would need to see how much memory your system is using as overhead (memory that should not be considered for jobs to use).

At our site we ran into this issue when running memory heavy jobs. Even though the jobs fit within the memory allocated, condor ended up allocating ALL of the memory it detected. We set aside 10GB that condor is not allowed to advertise for jobs, to allow some space for our system and things like GPFS filesystem which consumes memory. We set this on our Execution Points (workers), the value is in MiB:

RESERVED_MEMORY = 10000

This number might be different for you, it really depends on how much you need as overhead. Ideally you want this number as small as possible, so more memory is available for jobs, but making it too small won't help at all.

When we set this up, we picked a larger best guess (20GB) and saw that when full, we usually had a bit more than 10GB free, so we shrunk the number by 10GB and left it there



Michal


On 13/05/2026 22:07, Beyer, Christoph wrote:
Hi,

just my 2-cent - if you don't want memory limits to be enforced you can setÂ

CGROUP_MEMORY_LIMIT_POLICY = none

In the execution points config ...

Best
christophÂ


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Umut TÃrk" <umut1656@xxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 13. Mai 2026 20:40:07
Betreff: Re: [HTCondor-users] Condor Memory Error


Hello HTCondor Team,

I have a question about unexpected memory-limit enforcement on some jobs running under HTCondor 24.0.3.

HTCondor version and execute-node platform:
`$CondorVersion: 24.0.3 2025-01-03 BuildID: 777902 PackageID: 24.0.3-1 GitSHA: ef02b46e $`
`$CondorPlatform: x86_64_AlmaLinux9 $`

The jobs fail with the following message:

âJob has gone over cgroup memory limit of 9222 megabytes. Last measured usage: 1872 megabytes. Consider resubmitting with a higher request_memory.â

The main concern is the mismatch between the enforced cgroup memory limit and the last measured usage reported by HTCondor.

I found what appears to be a related historical issue that got fixed:

Relevant HTCondor configuration on our side includes:

* `CGROUP_MEMORY_LIMIT_POLICY = hard`
* `CGROUP_POLLING_INTERVAL = 1`

To investigate whether these were genuine OOM events, we checked kernel messages on the execute nodes. We did not find corresponding OOM killer messages in the kernel logs. We also verified separately that real OOM events do appear in kernel logs on these nodes, so the lack of such messages here makes the failure mode unclear.

At the moment, we cannot determine why HTCondor reports that the job exceeded the cgroup memory limit when the last measured usage is substantially lower, and we do not see node-level logs indicating an OOM event.

I tried using HOOK_JOB_EXIT to get a snapshot of the cgroup information in /sys/fs/cgroup/.../[dedicated-cgroup folder for the job], but this folder is destroyed by the time HOOK_JOB_EXIT runs.

Could you help clarify the following?

1. What memory metric is used to decide that a job has exceeded the cgroup memory limit in this case?
2. Can that metric differ significantly from the âLast measured usageâ reported in the hold message?
3. Are there additional starter, startd, or cgroup-related logs or debug settings you would recommend collecting to diagnose this further?
4. Does this sound like a known issue in 24.0.3, or possibly related to the older issue referenced above?

Thank you for any guidance.

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/