[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Job log not reporting Memory usage



Hi,
Back in condor version 10.9.0 we noticed that some of the jobs don't report memory usage at the job summary nor update ResidentSetSize while running.
As Version 10.9 was installed only for a short period we were hoping that the problem would go away after the migration to EL9 and HTCondor 23.0, but it didn't.
As we have a rule in our scheduler that holds jobs that don't make proper use of requested memory (less than 20% after one hour of run, was later relaxed to 10% after two hours) many jobs get unjustifiably held,
Bellow pasted a message I got from one of the users demonstrating the issue:

I think there is something wrong with how the cluster is tracking memory usage. I had a batch of jobs out of which many were held with the error "Job Is Wasting Memory using less than 20 percent of requested Memory". But I am also printingÂmemory usage from within the job, and I see it was at least at certain points during the simulation using more than half of the memory I requested.

To be specific - jobÂ1777592.240Â- this is the condor log:

000 (1777592.240.000) 2024-07-19 08:16:28 Job submitted from host: <IPHERE:9618?addrs=IPHERE-9618+[IPHERE]-9618&alias=ui01&noUDP&sock=schedd_1012727_10c1>

...

001 (1777592.240.000) 2024-07-19 08:16:30 Job executing on host: <IPHERE:9618?addrs=IPHERE-9618&alias=wn043l&noUDP&sock=startd_2234_eaaa>

SlotName: slot1_12@wn043

CondorScratchDir = "/var/lib/condor/execute/dir_2889673"

Cpus = 1

Disk = 1684347

Memory = 24576

...

006 (1777592.240.000) 2024-07-19 08:16:39 Image size of job updated: 1

0Â -Â MemoryUsage of job (MB)

0Â -Â ResidentSetSize of job (KB)

...

012 (1777592.240.000) 2024-07-19 10:16:52 Job was held.

Job Is Wasting Memory using less than 20 percent of requested Memory

Code 26 Subcode 0

...



Now in my code I am printing out the memory allocated during the heavy part of the calculation. The output of my code is below. Just to give a rough picture as to how the simulation works (in terms of memory usage) - the code is performing "sweeps", and every few sweeps the size of the matrices increases (the size of the matrices is related to the "maxlinkdim"Âvalue you see in the printout). During each sweep a "PH" object (which is the most memory consuming object) is created and its max size is printed out. As you can see it starts from 10s MiB but at the last sweep it already takes 15GiB. That is the last sweep completed and the next sweep was supposed to use a similar amount of memory. The memory I requested for this job as you can see in the condor log is 24GB, so I don't see why it should have killed it.

Running job for Nx=21, Ny=6, BC=YC, J2=0.125, B=0.75, Sz=0

|psi| = 1.787 MiB, |PH| = 22.651 MiB

After sweep 1 energy=-57.35777686955872Â maxlinkdim=50 maxerr=2.67E-04 time=40.112

|psi| = 1.773 MiB, |PH| = 22.029 MiB

After sweep 2 energy=-61.33259770305437Â maxlinkdim=50 maxerr=8.95E-04 time=3.824

|psi| = 1.767 MiB, |PH| = 21.833 MiB

After sweep 3 energy=-62.05173382479314Â maxlinkdim=50 maxerr=8.81E-04 time=3.291

|psi| = 1.763 MiB, |PH| = 21.833 MiB

After sweep 4 energy=-62.210260132491484Â maxlinkdim=50 maxerr=1.02E-03 time=3.777

|psi| = 4.939 MiB, |PH| = 75.107 MiB

After sweep 5 energy=-63.37855785077949Â maxlinkdim=100 maxerr=2.43E-04 time=7.774

|psi| = 4.913 MiB, |PH| = 74.529 MiB

After sweep 6 energy=-63.54330972096651Â maxlinkdim=100 maxerr=3.35E-04 time=8.961

|psi| = 4.898 MiB, |PH| = 74.278 MiB

After sweep 7 energy=-63.56556323251096Â maxlinkdim=100 maxerr=2.07E-04 time=9.056

|psi| = 4.895 MiB, |PH| = 74.252 MiB

After sweep 8 energy=-63.57158268684281Â maxlinkdim=100 maxerr=2.08E-04 time=9.007

|psi| = 16.701 MiB, |PH| = 280.079 MiB

After sweep 9 energy=-63.68292524215922Â maxlinkdim=200 maxerr=2.23E-05 time=26.626

|psi| = 16.853 MiB, |PH| = 282.479 MiB

After sweep 10 energy=-63.688157744568464Â maxlinkdim=200 maxerr=3.43E-05 time=32.239

|psi| = 16.874 MiB, |PH| = 282.805 MiB

After sweep 11 energy=-63.6899897816407Â maxlinkdim=200 maxerr=3.40E-05 time=31.535

|psi| = 16.879 MiB, |PH| = 282.896 MiB

After sweep 12 energy=-63.69151969587513Â maxlinkdim=200 maxerr=3.30E-05 time=31.935

|psi| = 60.559 MiB, |PH| = 1.039 GiB

After sweep 13 energy=-63.72891047008562Â maxlinkdim=400 maxerr=7.03E-06 time=97.186

|psi| = 60.868 MiB, |PH| = 1.043 GiB

After sweep 14 energy=-63.734091125339Â maxlinkdim=400 maxerr=1.23E-05 time=113.704

|psi| = 60.986 MiB, |PH| = 1.045 GiB

After sweep 15 energy=-63.73724146661207Â maxlinkdim=400 maxerr=1.26E-05 time=112.594

|psi| = 61.066 MiB, |PH| = 1.047 GiB

After sweep 16 energy=-63.73922147989802Â maxlinkdim=400 maxerr=1.28E-05 time=111.033

|psi| = 228.513 MiB, |PH| = 3.983 GiB

After sweep 17 energy=-63.75613465345605Â maxlinkdim=800 maxerr=3.20E-06 time=885.088

|psi| = 230.347 MiB, |PH| = 4.011 GiB

After sweep 18 energy=-63.759084903839515Â maxlinkdim=800 maxerr=5.92E-06 time=1121.049

|psi| = 231.161 MiB, |PH| = 4.024 GiB

After sweep 19 energy=-63.760502315902286Â maxlinkdim=800 maxerr=6.52E-06 time=1099.283

|psi| = 231.616 MiB, |PH| = 4.031 GiB

After sweep 20 energy=-63.761208014845515Â maxlinkdim=800 maxerr=6.97E-06 time=1010.624

|psi| = 875.139 MiB, |PH| = 15.370 GiB

After sweep 21 energy=-63.76911975644875Â maxlinkdim=1600 maxerr=1.80E-06 time=2298.161


Many thanks in advance,
David