El 3/2/23 a las 10:42, Javier Barbero GÃmez escribiÃ:
We recently updated the HTCondor version in our computer cluster from 9.12.0 to 10.2.0. I took the opportunity to also enable cgroups memory and CPU limits, as well as installing and enabling Docker on all machines.We set CGROUP_MEMORY_LIMIT_POLICY=hard and tested that it worked correctly using the stress-ng utility to fill up physical memory and use CPU resources. We have also ran a fairly large batch of real jobs with success and can confirm that the memory limits are being enforced correctly.However, we have noticed that the jobs' MemoryUsage attribute is no longer being updated and always stays at 0.... 006 (21821.012.000) 2023-02-02 14:41:15 Image size of job updated: 22  Â0 - MemoryUsage of job (MB)  Â0 - ResidentSetSize of job (KB) ... 005 (21821.012.000) 2023-02-02 14:41:15 Job terminated.  Â(1) Normal termination (return value 0)  Â ÂUsr 0 00:13:02, Sys 0 00:01:29 - Run Remote Usage  Â ÂUsr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage  Â ÂUsr 0 00:13:02, Sys 0 00:01:29 - Total Remote Usage  Â ÂUsr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage  Â0 - Run Bytes Sent By Job  Â0 - Run Bytes Received By Job  Â0 - Total Bytes Sent By Job  Â0 - Total Bytes Received By Job  ÂPartitionable Resources : Usage Request Allocated Assigned  Â Cpus : 1.12 2 2  Â Disk (KB) : 22 22 860697 Â Gpus (Average) : 20525292.92 1 1 "GPU-8675f80c" Â GpusMemory (MB) : 4044  Â Memory (MB) : 0 20480 20480 ÂJob terminated of its own accord at 2023-02-02T13:41:15Z with exit-code 0....We can confirm by reading old logs that memory usage was being reported correctly just before the update. This new problem happens for all machines in the cluster (they are 6, but the central manager lacks the use role:get_htcondor_execute). They all share the exact same configuration except 3 of them have GPUs enabled.The only new configuration added with the update was the aforementioned Docker support and cgroups to the 5 worker nodes. I tried disabling the new configuration, but the problem persists.I ran the following command while debugging to inspect the test job ClassAds: condor_q -run -long | less. I was unable to find the MemoryUsage attribute, although I'm not sure if it should appear there. I've also looked into finished jobs using condor_history -constraint '(ClusterId == 21821) && (JobStatus == 4)' -af MemoryUsage, but it is always reported as 0. The ImageSize attribute is reported correctly while running for the stress-ng tests, although it is weirdly small for the real jobs (just 22).We are using Debian 11 in all machines and all packages were updated at the same time with HTCondor (which was updated to the latest current release available in the Debian repo, which was 10.2.0) and kernel version 5.10.0-21-amd64_______________________________________________ HTCondor-users mailing listTo unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with asubject: Unsubscribe You can also unsubscribe by visitinghttps://urldefense.com/v3/__https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users__;!!D9dNQwwGXtA!Vak_oPqgmgR0WmFNX4dCjVeUhip0uPDvdONarUBhQsNMJrdh-EskfFwF8-VDZiNtZgoA8tau9z5qhrY$The archives can be found at:https://urldefense.com/v3/__https://lists.cs.wisc.edu/archive/htcondor-users/__;!!D9dNQwwGXtA!Vak_oPqgmgR0WmFNX4dCjVeUhip0uPDvdONarUBhQsNMJrdh-EskfFwF8-VDZiNtZgoA8tauNEMpCoY$