[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor Memory Error



Thank you for responding to my messages so far. I believe it would be helpful to share a bit more about our configuration - as it might provide better insight on the error we are facing.Â

At first, also thought this was a condor sampling issue. It is possible for condor to report a low memory usage at t=0 when job goes out of memory at t=4. We reduced the sampling interval to 1 second, and we continued to observe the same error. I agree,it is possible for the job to request more memory within that second, but we did not see a reduction in number of jobs failing this way.Â

As for the fix, we recommend submitting jobs with slightly more memory to alleviate this error. Since we can't reproduce the error on demand, we could not say anything conclusive whether it actually helps or not.Â

We also wanted to make sure there was enough system memory while jobs are running. On average, we have 1-2TB memory on our worker machines, and reserve 12GB memory for the system. While these OOM kills are happening, we see 100+ GB free memory on our systems.Â

What makes it even more confusing is that while Condor kills our jobs, we are not seeing OOM logs for these killins on dmesg (we made sure that our OOM test cases gets logged to dmesg, so there is no problem on that end).Â

We also set CGROUP_IGNORE_CACHE_MEMORY to false in case it was a high caching issue, but we did not see any difference.Â

If anyone has any ideas on why this error is still happen, I would appreciate hearing their thoughts.Â

-Umut



On Thu, May 14, 2026, 03:17 svatosm@xxxxxx <svatosm@xxxxxx> wrote:
Hi,
I have observed this issue recently and asked about it:

https://www-auth.cs.wisc.edu/lists/htcondor-users/2026-February/msg00067.shtml

I went to a meeting with devs and their summary (by Tom Smith) is this:


It's not completely unusual to see the message that "Job has gone over cgroup memory limit of X. Last measured usage is Y", where the value of Y is much less than X

What can happen is condor is tracking the job's memory usage every 5 seconds, so in that time it is possible to use enough memory that it exceeds the limit and the cgroup kills it the moment it goes a byte over.

If jobs consume memory very slowly the displayed usage value will be pretty close to the limit, but if the memory explodes it can be quite different (many GB)

In this case the "fix" is to allocate more memory for the job, or perhaps investigate why the job's executable is using more memory than you think it should be using (and very quickly)

The second case we talked about is the case where the worker node running the jobs ran out of memory, and the system itself started to OOM kill things. Condor jobs are pretty attractive for the OOM killer so they are usually the first to go.

In this case you would need to see how much memory your system is using as overhead (memory that should not be considered for jobs to use).

At our site we ran into this issue when running memory heavy jobs. Even though the jobs fit within the memory allocated, condor ended up allocating ALL of the memory it detected. We set aside 10GB that condor is not allowed to advertise for jobs, to allow some space for our system and things like GPFS filesystem which consumes memory. We set this on our Execution Points (workers), the value is in MiB:

RESERVED_MEMORY = 10000

This number might be different for you, it really depends on how much you need as overhead. Ideally you want this number as small as possible, so more memory is available for jobs, but making it too small won't help at all.

When we set this up, we picked a larger best guess (20GB) and saw that when full, we usually had a bit more than 10GB free, so we shrunk the number by 10GB and left it there



Michal


On 13/05/2026 22:07, Beyer, Christoph wrote:
Hi,

just my 2-cent - if you don't want memory limits to be enforced you can setÂ

CGROUP_MEMORY_LIMIT_POLICY = none

In the execution points config ...

Best
christophÂ


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Umut TÃrk" <umut1656@xxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 13. Mai 2026 20:40:07
Betreff: Re: [HTCondor-users] Condor Memory Error


Hello HTCondor Team,

I have a question about unexpected memory-limit enforcement on some jobs running under HTCondor 24.0.3.

HTCondor version and execute-node platform:
`$CondorVersion: 24.0.3 2025-01-03 BuildID: 777902 PackageID: 24.0.3-1 GitSHA: ef02b46e $`
`$CondorPlatform: x86_64_AlmaLinux9 $`

The jobs fail with the following message:

âJob has gone over cgroup memory limit of 9222 megabytes. Last measured usage: 1872 megabytes. Consider resubmitting with a higher request_memory.â

The main concern is the mismatch between the enforced cgroup memory limit and the last measured usage reported by HTCondor.

I found what appears to be a related historical issue that got fixed:

Relevant HTCondor configuration on our side includes:

* `CGROUP_MEMORY_LIMIT_POLICY = hard`
* `CGROUP_POLLING_INTERVAL = 1`

To investigate whether these were genuine OOM events, we checked kernel messages on the execute nodes. We did not find corresponding OOM killer messages in the kernel logs. We also verified separately that real OOM events do appear in kernel logs on these nodes, so the lack of such messages here makes the failure mode unclear.

At the moment, we cannot determine why HTCondor reports that the job exceeded the cgroup memory limit when the last measured usage is substantially lower, and we do not see node-level logs indicating an OOM event.

I tried using HOOK_JOB_EXIT to get a snapshot of the cgroup information in /sys/fs/cgroup/.../[dedicated-cgroup folder for the job], but this folder is destroyed by the time HOOK_JOB_EXIT runs.

Could you help clarify the following?

1. What memory metric is used to decide that a job has exceeded the cgroup memory limit in this case?
2. Can that metric differ significantly from the âLast measured usageâ reported in the hold message?
3. Are there additional starter, startd, or cgroup-related logs or debug settings you would recommend collecting to diagnose this further?
4. Does this sound like a known issue in 24.0.3, or possibly related to the older issue referenced above?

Thank you for any guidance.

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/