Hi Carles,I personally only take RSS as rough upper limit since it includes also shared pages (which probably inflates the memory usage by a process group especially on larger nodes with a lot of jobs potentially using the same shared libraries). Imho PSS is a more realistic metric - or the cgroup memory controller metrics.
Cheers, Thomas On 21/02/2024 12.10, Carles Acosta wrote:
Hi,I just want to comment that I continue to observe the memory issue with HTCondor and EL9. I cannot rely on adding MEMORY_EXCEED conditions in HTCondor/AlmaLinux9.Let me show you another example with Atlas jobs, both a top-xaod execution. *_Alma9 WN _* * HTCondor version 23.0.3 * RES memory according to top: 3 GB * ResidentSetSize according to condor:# condor_q 22018463.0 -af ResidentSetSize_RAW/1024 MemoryUsage RequestMemory RemoteHost8662 9766 3500 slot1_10@xxxxxxxxxxxx <mailto:slot1_10@xxxxxxxxxxxx> * Cgroup memory:# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_10\@td807.pic.es/memory.current <http://td807.pic.es/memory.current>3279331328 *_CentOs7 WN_* * HTCondor version 9.0.17 * RES memory according to top: 3.2 GB * ResidentSetSize according to condor:# condor_q 19600433.0 -af ResidentSetSize_RAW/1024 MemoryUsage RequestMemory RemoteHost3329 3418 3500 slot1_10@xxxxxxxxxxxxx <mailto:slot1_10@xxxxxxxxxxxxx> * Cgroup memory:# cat /sys/fs/cgroup/memory/system.slice/condor.service/condor_home_execute_slot1_10\@tds410.pic.es/memory.usage_in_bytes <http://tds410.pic.es/memory.usage_in_bytes>3496230912So, again, the ResidentSetSize and MemoryUsage exceed what we can see on top or cgroup memory report. I do not know if we are the only ones seeing this problem...Cheers, CarlesOn Thu, 16 Nov 2023 at 13:45, Carles Acosta <cacosta@xxxxxx <mailto:cacosta@xxxxxx>> wrote:Hi again, Although the problem is most prominent for the LHCb experiment, I have been investigating if it affects other projects as well. For example, for CMS. I checked all the jobs where the ResidentSetSize is over 1.5 times the RequestedMemory: 18 of 186 jobs. Of these 18 jobs, 17 are running on AlmaLinux 9 WNs. One example: [root@td810 ~]# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_11@xxxxxxxxxxxx/memory.current <http://condor_home_execute_slot1_11@xxxxxxxxxxxx/memory.current> 25772843008 [root@ce13 ~]# condor_q 21238903 -af ResidentSetSize ResidentSetSize_RAW 32500000 31951976 As we only put on hold the jobs that exceed 2 times the RequestedMemory, the CMs jobs can still run. I'm not sure, this is not a huge issue for CMS, but in general it seems that on AlmaLinux 9 WNs the ResidentSetSize values reported are higher. Cheers, Carles On Fri, 10 Nov 2023 at 06:43, Carles Acosta <cacosta@xxxxxx <mailto:cacosta@xxxxxx>> wrote: Hi Greg, The old job is finished. But for another job (exactly the same ResidentSetSize as before...): [root@ce13 ~]# condor_q 21206191.0 -af ClusterId ProcId Owner RemoteHost ResidentSetSize/1024 ResidentSetSize_RAW/1024 21206191 0 lhpilot001 slot1_54@xxxxxxxxxxxx <mailto:slot1_54@xxxxxxxxxxxx> 12207 11011 [root@td813 ~]# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_54\@td813.pic.es/memory.current <http://td813.pic.es/memory.current> 8253329408 For a Centos7 job: [root@ce13 ~]# condor_q 21205088 -af ClusterId ProcId Owner RemoteHost ResidentSetSize/1024 ResidentSetSize_RAW/1024 21205088 0 lhpilot001 slot1_33@xxxxxxxxxxxxx <mailto:slot1_33@xxxxxxxxxxxxx> 7324 6608 [root@tds408 condor]# cat /sys/fs/cgroup/memory/system.slice/condor.service/condor_home_execute_slot1_33\@tds408.pic.es/memory.usage_in_bytes <http://tds408.pic.es/memory.usage_in_bytes> 7089156096 Cheers, Carles On Thu, 9 Nov 2023 at 19:08, Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx <mailto:htcondor-users@xxxxxxxxxxx>> wrote: __[root@ce13 ~]# condor_q 21200475 21200476 -af ClusterId ProcId Owner RemoteHost ResidentSetSize/1024 21200475 0 lhpilot001 slot1_43@xxxxxxxxxxxx <mailto:slot1_43@xxxxxxxxxxxx> 12207 21200476 0 lhpilot001 slot1_51@xxxxxxxxxxxx <mailto:slot1_51@xxxxxxxxxxxx> 1708Hi Carles: I'm curious what the value that cgroup is reporting for this job is. Can you tell us the contents of the "memory.current" value for the cgroup that job is in? And the errors about the missing "memory.peak" files are ok. I'll try to change the error message to indicate this better. -greg _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users> The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ <https://lists.cs.wisc.edu/archive/htcondor-users/>-- Carles Acosta i SilvaPIC (Port d'Informacià CientÃfica) Campus UAB, Edifici D E-08193 Bellaterra, Barcelona Tel: +34 93 581 33 08 Fax: +34 93 581 41 10 http://www.pic.es <http://www.pic.es> AvÃs - Aviso - Legal Notice: http://legal.ifae.es <http://legal.ifae.es/>-- Carles Acosta i SilvaPIC (Port d'Informacià CientÃfica) Campus UAB, Edifici D E-08193 Bellaterra, Barcelona Tel: +34 93 581 33 08 Fax: +34 93 581 41 10 http://www.pic.es <http://www.pic.es> AvÃs - Aviso - Legal Notice: http://legal.ifae.es <http://legal.ifae.es/> -- Carles Acosta i Silva PIC (Port d'Informacià CientÃfica) Campus UAB, Edifici D E-08193 Bellaterra, Barcelona Tel: +34 93 581 33 08 Fax: +34 93 581 41 10 http://www.pic.es <http://www.pic.es> AvÃs - Aviso - Legal Notice: http://legal.ifae.es <http://legal.ifae.es/> _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature