[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] More thoughts on memory limits



Hi Jeff,

Thank you for your reply. For now, we are only worried about the memory usage. We did not detect any issue with the CPU or other usage numbers while the jobs are running but we are going to check it as well.

Cheers,

Carles


On Thu, 21 Nov 2024 at 10:58, Jeff Templon <templon@xxxxxxxxx> wrote:
Hi Carles

This may be related to an issue weâre seeing here with capturing resource usage. See e.g. the following:

JOB_ID  ÂUsername Class   CMD        Finished  CPUS ÂCPuse ÂMEMREQ  ÂRAM   ÂMEM   ST  CPU_TIME  ÂLWALL_TIME ÂWorkerNode
946757.0  jtho    long data-theorie-jthoe 11/21 07:19  1 0.999 32.0 GB  732.4 MB  732.4 MB C   Â13:03:43   13:10:42 wn-lot-045
946741.0  jtho    long data-theorie-jthoe 11/21 06:55  1 1.000 32.0 GB  732.4 MB  732.4 MB C   Â13:03:53   13:04:36 wn-pijl-007
946581.0  jtho    long data-theorie-jthoe 11/21 05:59  1 1.000 32.0 GB  732.4 MB  732.4 MB C   Â15:59:24   15:59:40 wn-lot-002
946889.0  jtho    long data-theorie-jthoe 11/21 05:59  1 1.000 32.0 GB   9.8 MB   9.8 MB C       0   10:38:29 wn-lot-060
946732.0  jtho    long data-theorie-jthoe 11/21 05:45  1 0.999 32.0 GB  732.4 MB  732.4 MB C   Â12:20:45   12:21:21 wn-pijl-004
946842.0  jtho    long data-theorie-jthoe 11/21 05:23  1 0.997 32.0 GB   1.2 GB   1.4 GB C   Â10:38:52   10:41:09 wn-pijl-001
946440.0  jtho    long data-theorie-jthoe 11/21 05:04  1 0.999 32.0 GB   1.2 GB   1.4 GB C   Â17:29:34   17:30:26 wn-pijl-006

You can see that for one of these lines, the CPU_TIME is zero, and the memory usage is significantly lower. Iâve seen this with my own test jobs, and looking at what the test jobs themselves (internally) report, they have the normal usage - HTCondor is somehow not always getting the right usage numbers.

JT


On 21 Nov 2024, at 10:11, Carles Acosta <cacosta@xxxxxx> wrote:

Dear all,

We are running 23.10.1 version in all our EPs. We took the opportunityÂto add again a memory limit:

CGROUP_IGNORE_CACHE_MEMORY = True
MEMORY_EXCEEDED = (MemoryUsage isnt undefined && MemoryUsage > Memory*3)
use POLICY : WANT_HOLD_IF(MEMORY_EXCEEDED, 102, peak memory usage exceeded requested memory by 3 times)

The limit is generous, 3 times, becauseÂwe first want to test how this evolves.

After 3 weeks, it is clear that we do not have the huge overestimation of memory usage we saw in the past. However, it seems that the MEMORY_EXCEEDED _expression_ is generating some false positives. For instance, the same job wasÂsubmittedÂtwo times, the first time it shows a memory usage of 14 GB, and the second time, it shows a regular memory usage of 4 GB. I understand that this is the cgroups memory.peak, right? For CentOs7 or cgroupsv1, was the same max value considered (memory.max_usage_in_bytes) or the current (memory.usage_in_bytes)?Â

Does any other site use a limit like this? What is your experience?

Best regards,

Carles

--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es