[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Zero MemoryUsage after update to 10.x



We recently updated the HTCondor version in our computer cluster from 9.12.0 to 10.2.0. I took the opportunity to also enable cgroups memory and CPU limits, as well as installing and enabling Docker on all machines.

We set CGROUP_MEMORY_LIMIT_POLICY=hard and tested that it worked correctly using the stress-ng utility to fill up physical memory and use CPU resources. We have also ran a fairly large batch of real jobs with success and can confirm that the memory limits are being enforced correctly.

However, we have noticed that the jobs' MemoryUsage attribute is no longer being updated and always stays at 0.

...
006 (21821.012.000) 2023-02-02 14:41:15 Image size of job updated: 22
ÂÂ Â0Â -Â MemoryUsage of job (MB)
ÂÂ Â0Â -Â ResidentSetSize of job (KB)
...
005 (21821.012.000) 2023-02-02 14:41:15 Job terminated.
ÂÂ Â(1) Normal termination (return value 0)
ÂÂ ÂÂÂ ÂUsr 0 00:13:02, Sys 0 00:01:29Â -Â Run Remote Usage
ÂÂ ÂÂÂ ÂUsr 0 00:00:00, Sys 0 00:00:00Â -Â Run Local Usage
ÂÂ ÂÂÂ ÂUsr 0 00:13:02, Sys 0 00:01:29Â -Â Total Remote Usage
ÂÂ ÂÂÂ ÂUsr 0 00:00:00, Sys 0 00:00:00Â -Â Total Local Usage
ÂÂ Â0Â -Â Run Bytes Sent By Job
ÂÂ Â0Â -Â Run Bytes Received By Job
ÂÂ Â0Â -Â Total Bytes Sent By Job
ÂÂ Â0Â -Â Total Bytes Received By Job
 ÂPartitionable Resources : Usage Request Allocated Assigned
ÂÂ ÂÂÂ CpusÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ :ÂÂÂÂÂÂÂ 1.12ÂÂÂÂÂÂÂ 2ÂÂÂÂÂÂÂÂ 2
ÂÂ ÂÂÂ Disk (KB)ÂÂÂÂÂÂÂÂÂÂÂ :ÂÂÂÂÂÂ 22ÂÂÂÂÂÂÂÂÂ 22ÂÂÂ 860697
ÂÂ ÂÂÂ Gpus (Average)ÂÂÂÂÂÂ : 20525292.92ÂÂÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂ 1 "GPU-8675f80c"
ÂÂ ÂÂÂ GpusMemory (MB)ÂÂÂÂÂ :ÂÂÂÂ 4044
ÂÂ ÂÂÂ Memory (MB)ÂÂÂÂÂÂÂÂÂ :ÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂ 20480ÂÂÂÂ 20480

ÂÂ ÂJob terminated of its own accord at 2023-02-02T13:41:15Z with exit-code 0.
...

We can confirm by reading old logs that memory usage was being reported correctly just before the update. This new problem happens for all machines in the cluster (they are 6, but the central manager lacks the use role:get_htcondor_execute). They all share the exact same configuration except 3 of them have GPUs enabled.

The only new configuration added with the update was the aforementioned Docker support and cgroups to the 5 worker nodes. I tried disabling the new configuration, but the problem persists.

I ran the following command while debugging to inspect the test job ClassAds: condor_q -run -long | less. I was unable to find the MemoryUsage attribute, although I'm not sure if it should appear there. I've also looked into finished jobs using condor_history -constraint '(ClusterId == 21821) && (JobStatus == 4)' -af MemoryUsage, but it is always reported as 0. The ImageSize attribute is reported correctly while running for the stress-ng tests, although it is weirdly small for the real jobs (just 22).

We are using Debian 11 in all machines and all packages were updated at the same time with HTCondor (which was updated to the latest current release available in the Debian repo, which was 10.2.0) and kernel version 5.10.0-21-amd64