Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Zero MemoryUsage after update to 10.x

Date: Fri, 03 Feb 2023 10:42:20 +0100
From: Javier Barbero GÃmez <jbarbero@xxxxxx>
Subject: [HTCondor-users] Zero MemoryUsage after update to 10.x

We recently updated the HTCondor version in our computer cluster from9.12.0 to 10.2.0. I took the opportunity to also enable cgroups memoryand CPU limits, as well as installing and enabling Docker on all machines.

We set CGROUP_MEMORY_LIMIT_POLICY=hard and tested that it workedcorrectly using the stress-ng utility to fill up physical memory and useCPU resources. We have also ran a fairly large batch of real jobs withsuccess and can confirm that the memory limits are being enforced correctly.

However, we have noticed that the jobs' MemoryUsage attribute is nolonger being updated and always stays at 0.

...
006 (21821.012.000) 2023-02-02 14:41:15 Image size of job updated: 22
ÂÂ Â0Â -Â MemoryUsage of job (MB)
ÂÂ Â0Â -Â ResidentSetSize of job (KB)
...
005 (21821.012.000) 2023-02-02 14:41:15 Job terminated.
ÂÂ Â(1) Normal termination (return value 0)
ÂÂ ÂÂÂ ÂUsr 0 00:13:02, Sys 0 00:01:29Â -Â Run Remote Usage
ÂÂ ÂÂÂ ÂUsr 0 00:00:00, Sys 0 00:00:00Â -Â Run Local Usage
ÂÂ ÂÂÂ ÂUsr 0 00:13:02, Sys 0 00:01:29Â -Â Total Remote Usage
ÂÂ ÂÂÂ ÂUsr 0 00:00:00, Sys 0 00:00:00Â -Â Total Local Usage
ÂÂ Â0Â -Â Run Bytes Sent By Job
ÂÂ Â0Â -Â Run Bytes Received By Job
ÂÂ Â0Â -Â Total Bytes Sent By Job
ÂÂ Â0Â -Â Total Bytes Received By Job
ÂÂ ÂPartitionable Resources :ÂÂÂÂÂÂ UsageÂ Request Allocated Assigned
ÂÂ ÂÂÂ CpusÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ :ÂÂÂÂÂÂÂ 1.12ÂÂÂÂÂÂÂ 2ÂÂÂÂÂÂÂÂ 2
ÂÂ ÂÂÂ Disk (KB)ÂÂÂÂÂÂÂÂÂÂÂ :ÂÂÂÂÂÂ 22ÂÂÂÂÂÂÂÂÂ 22ÂÂÂ 860697

ÂÂ ÂÂÂ Gpus (Average)ÂÂÂÂÂÂ : 20525292.92ÂÂÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂ 1"GPU-8675f80c"

ÂÂ ÂÂÂ GpusMemory (MB)ÂÂÂÂÂ :ÂÂÂÂ 4044
ÂÂ ÂÂÂ Memory (MB)ÂÂÂÂÂÂÂÂÂ :ÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂ 20480ÂÂÂÂ 20480

ÂÂ ÂJob terminated of its own accord at 2023-02-02T13:41:15Z withexit-code 0.

...

We can confirm by reading old logs that memory usage was being reportedcorrectly just before the update. This new problem happens for allmachines in the cluster (they are 6, but the central manager lacks theuse role:get_htcondor_execute). They all share the exact sameconfiguration except 3 of them have GPUs enabled.

The only new configuration added with the update was the aforementionedDocker support and cgroups to the 5 worker nodes. I tried disabling thenew configuration, but the problem persists.

I ran the following command while debugging to inspect the test jobClassAds: condor_q -run -long | less. I was unable to find theMemoryUsage attribute, although I'm not sure if it should appear there.I've also looked into finished jobs using condor_history -constraint'(ClusterId == 21821) && (JobStatus == 4)' -af MemoryUsage, but it isalways reported as 0. The ImageSize attribute is reported correctlywhile running for the stress-ng tests, although it is weirdly small forthe real jobs (just 22).

We are using Debian 11 in all machines and all packages were updated atthe same time with HTCondor (which was updated to the latest currentrelease available in the Debian repo, which was 10.2.0) and kernelversion 5.10.0-21-amd64

Follow-Ups:
- Re: [HTCondor-users] Zero MemoryUsage after update to 10.x
  - From: Greg Thain
- Re: [HTCondor-users] Zero MemoryUsage after update to 10.x
  - From: Javier Barbero GÃmez

Prev by Date: [HTCondor-users] Got 23andMe or AncestryDNA?
Next by Date: Re: [HTCondor-users] Zero MemoryUsage after update to 10.x
Previous by thread: [HTCondor-users] Got 23andMe or AncestryDNA?
Next by thread: Re: [HTCondor-users] Zero MemoryUsage after update to 10.x
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] Zero MemoryUsage after update to 10.x