[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] More thoughts on memory limits



Hi Carles

This may be related to an issue weâre seeing here with capturing resource usage.  See e.g. the following:

JOB_ID    Username Class     CMD               Finished   CPUS  CPuse  MEMREQ    RAM      MEM     ST   CPU_TIME    LWALL_TIME  WorkerNode
946757.0   jtho       long data-theorie-jthoe 11/21 07:19   1 0.999 32.0 GB   732.4 MB   732.4 MB C      13:03:43     13:10:42 wn-lot-045
946741.0   jtho       long data-theorie-jthoe 11/21 06:55   1 1.000 32.0 GB   732.4 MB   732.4 MB C      13:03:53     13:04:36 wn-pijl-007
946581.0   jtho       long data-theorie-jthoe 11/21 05:59   1 1.000 32.0 GB   732.4 MB   732.4 MB C      15:59:24     15:59:40 wn-lot-002
946889.0   jtho       long data-theorie-jthoe 11/21 05:59   1 1.000 32.0 GB     9.8 MB     9.8 MB C             0     10:38:29 wn-lot-060
946732.0   jtho       long data-theorie-jthoe 11/21 05:45   1 0.999 32.0 GB   732.4 MB   732.4 MB C      12:20:45     12:21:21 wn-pijl-004
946842.0   jtho       long data-theorie-jthoe 11/21 05:23   1 0.997 32.0 GB     1.2 GB     1.4 GB C      10:38:52     10:41:09 wn-pijl-001
946440.0   jtho       long data-theorie-jthoe 11/21 05:04   1 0.999 32.0 GB     1.2 GB     1.4 GB C      17:29:34     17:30:26 wn-pijl-006

You can see that for one of these lines, the CPU_TIME is zero, and the memory usage is significantly lower. Iâve seen this with my own test jobs, and looking at what the test jobs themselves (internally) report, they have the normal usage - HTCondor is somehow not always getting the right usage numbers.

JT


On 21 Nov 2024, at 10:11, Carles Acosta <cacosta@xxxxxx> wrote:

Dear all,

We are running 23.10.1 version in all our EPs. We took the opportunity to add again a memory limit:

CGROUP_IGNORE_CACHE_MEMORY = True
MEMORY_EXCEEDED = (MemoryUsage isnt undefined && MemoryUsage > Memory*3)
use POLICY : WANT_HOLD_IF(MEMORY_EXCEEDED, 102, peak memory usage exceeded requested memory by 3 times)

The limit is generous, 3 times, because we first want to test how this evolves.

After 3 weeks, it is clear that we do not have the huge overestimation of memory usage we saw in the past. However, it seems that the MEMORY_EXCEEDED _expression_ is generating some false positives. For instance, the same job was submitted two times, the first time it shows a memory usage of 14 GB, and the second time, it shows a regular memory usage of 4 GB. I understand that this is the cgroups memory.peak, right? For CentOs7 or cgroupsv1, was the same max value considered (memory.max_usage_in_bytes) or the current (memory.usage_in_bytes)? 

Does any other site use a limit like this? What is your experience?

Best regards,

Carles

--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice:  http://legal.ifae.es
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/