Hi Christoph!
When you say "memory used" here, how are you measuring it?
As I suggested in the parallel email, this it is a surprisingly difficult to nail down a definition of 'memory used' as there are decisions on what memory to include or exclude.
Brian
On Dec 2, 2024, at 4:45âAM, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
Hi,
is this confirmed ???
It would explain the big discrepancy between actual memory usage and what condor records (see below)
What is the way to probe for peak memory usage and how are users supposed to train their memory usage if the peak usage is no longer in the histiry of the job ???
memory used = 5632 according to condor = 2686
memory used = 4608 according to condor = 171
memory used = 2048 according to condor = 293
memory used = 512 according to condor = 220
memory used = 8192 according to condor = 7325
memory used = 5632 according to condor = 1465
memory used = 4096 according to condor = 2198
memory used = 7168 according to condor = 7325
memory used = 4096 according to condor = 733
memory used = 4608 according to condor = 733
memory used = 7168 according to condor = 733
memory used = 7168 according to condor = 7325
memory used = 4608 according to condor = 977
memory used = 4608 according to condor = 3663
memory used = 9216 according to condor = 3907
memory used = 8704 according to condor = 9766
memory used = 7168 according to condor = 3418
memory used = 9216 according to condor = 9766
memory used = 8192 according to condor = 7325
memory used = 1024 according to condor = 342
memory used = 6144 according to condor = 7325
memory used = 7680 according to condor = 7325
memory used = 9216 according to condor = 2930
memory used = 2048 according to condor = 464
memory used = 9728 according to condor = 9766
memory used = 4608 according to condor = 3663
memory used = 6656 according to condor = 3418
memory used = 3584 according to condor = 1465
memory used = 8192 according to condor = 2442
memory used = 7168 according to condor = 4395
memory used = 5120 according to condor = 2198
memory used = 6144 according to condor = 7325
memory used = 7680 according to condor = 3174
memory used = 4608 according to condor = 1954
memory used = 7168 according to condor = 1954
memory used = 7680 according to condor = 3174
memory used = 2560 according to condor = 2198
memory used = 3584 according to condor = 1954
memory used = 4608 according to condor = 3663
memory used = 4608 according to condor = 3663
memory used = 4096 according to condor = 733
memory used = 3584 according to condor = 3174
memory used = 1536 according to condor = 1465
memory used = 1536 according to condor = 147
memory used = 7168 according to condor = 977
memory used = 3072 according to condor = 489
memory used = 3584 according to condor = 416
memory used = 3072 according to condor = 977
memory used = 7680 according to condor = 391
memory used = 2560 according to condor = 1709
memory used = 2560 according to condor = 1465
memory used = 9216 according to condor = 3663
memory used = 8704 according to condor = 7325
memory used = 7680 according to condor = 1465
memory used = 3584 according to condor = 98
memory used = 8704 according to condor = 2442
memory used = 4096 according to condor = 2930
memory used = 3584 according to condor = 2930
memory used = 5120 according to condor = 4395
memory used = 8192 according to condor = 1221
memory used = 5632 according to condor = 4883
memory used = 512 according to condor = 98
memory used = 512 according to condor = 440
memory used = 10240 according to condor = 977
memory used = 6144 according to condor = 7325
memory used = 6144 according to condor = 3418
memory used = 4608 according to condor = 977
memory used = 2560 according to condor = 1465
memory used = 2560 according to condor = 74
memory used = 5632 according to condor = 269
memory used = 8704 according to condor = 1954
memory used = 3584 according to condor = 977
memory used = 9728 according to condor = 4395
memory used = 4096 according to condor = 293
memory used = 6656 according to condor = 1221
memory used = 2560 according to condor = 1465
memory used = 5120 according to condor = 3174
memory used = 6144 according to condor = 4883
memory used = 5632 according to condor = 7325
memory used = 8192 according to condor = 4639
memory used = 8192 according to condor = 8
memory used = 6656 according to condor = 7325
memory used = 6144 according to condor = 2198
memory used = 6144 according to condor = 733
memory used = 6656 according to condor = 2
memory used = 3072 according to condor = 977
memory used = 8192 according to condor = 4395
memory used = 10240 according to condor = 7325
memory used = 7680 according to condor = 7325
memory used = 5120 according to condor = 3174
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
Von: "Petr Vokac" <petr.vokac@xxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>, "Carles Acosta" <cacosta@xxxxxx>
Gesendet: Sonntag, 1. Dezember 2024 12:24:34
Betreff: Re: [HTCondor-users] More thoughts on memory limits
Since 23.10.1 HTCondor no longer use memory.peak but instead memory.current
https://htcondor.readthedocs.io/en/latest/version-history/feature-versions-23-x.html#version-23-10-1
If I understand this update correctly: in the past MemoryUsage provided information about maximum used memory and since 23.10.1 this classAd contain the last known memory usage value. So, it no longer make too much sense to look at this value for finished jobs.
Still, if your schedd is "lucky" when evaluating your memory policy _expression_ than memory.current can be at "peak" memory of currently running job (so I guess if job consume huge amount of memory even for fraction of second there is still non-zero probability
this _expression_ can be evaluated to true).
btw: I find questionable to re-use existing cgroup slot from previous jobs with stuck processes for a new HTCondor job and I hope that developers comes with cleaner solution in future...
Petr
On 11/21/24 10:11, Carles Acosta wrote:
Dear all,
We are running 23.10.1 version in all our EPs. We took the opportunity to add again a memory limit:
CGROUP_IGNORE_CACHE_MEMORY = True
MEMORY_EXCEEDED = (MemoryUsage isnt undefined && MemoryUsage > Memory*3)
use POLICY : WANT_HOLD_IF(MEMORY_EXCEEDED, 102, peak memory usage exceeded requested memory by 3 times)
The limit is generous, 3 times, because we first want to test how this evolves.
After 3 weeks, it is clear that we do not have the huge overestimation of memory usage we saw in the past. However, it seems that the MEMORY_EXCEEDED _expression_ is generating some false positives. For instance, the same job was submitted two times, the
first time it shows a memory usage of 14 GB, and the second time, it shows a regular memory usage of 4 GB. I understand that this is the cgroups memory.peak, right? For CentOs7 or cgroupsv1, was the same max value considered (memory.max_usage_in_bytes) or
the current (memory.usage_in_bytes)?
Does any other site use a limit like this? What is your experience?
Best regards,
Carles
--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
|