[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Weirdness with cputime, wall time and cpususage



We are seeing this at Fermilab as well. 

Steve Timm

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Jeff Templon <templon@xxxxxxxxx>
Sent: Saturday, June 15, 2024 9:19 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Weirdness with cputime, wall time and cpususage
 

[EXTERNAL] – This message is from an external sender

Hi,

And the question remains : why is the RemoteUserCPU so often zero?

$ condor_history  -completedsince $(date -d "8 hour ago" +"%s") -limit 512 -print-format ~templon/done.cpf -wide:132 -limit 16 templon
JOB_ID    Username    CMD              Finished   CPUS     RAM      MEM      ST   CPU_TIME     WALL_TIME  WorkerNode
24909.528 templon     ferm.condor 785     6/15 16:14   1  26.2 MB    26.9 MB C              0      6:11:06 wn-pijl-003
24909.50  templon     ferm.condor 307     6/15 13:53   1  63.9 MB    73.2 MB C        3:51:55      3:51:44 wn-pijl-007
24909.115 templon     ferm.condor 372     6/15 13:51   1  28.0 MB    29.3 MB C              0      3:49:40 wn-pijl-002
24909.100 templon     ferm.condor 357     6/15 13:44   1  65.7 MB    73.2 MB C        3:43:19      3:43:09 wn-pijl-001
24909.151 templon     ferm.condor 408     6/15 13:39   1  73.7 MB    97.7 MB C        3:38:20      3:38:11 wn-pijl-004
24301.0   templon     mers.condor 193     6/15 13:05   1  10.2 MB    12.2 MB C              0     20:43:28 wn-pijl-006
24909.626 templon     ferm.condor 883     6/15 12:59   1  19.3 MB    22.0 MB C              0      2:03:38 wn-sate-068
24909.71  templon     ferm.condor 328     6/15 12:58   1  63.1 MB    73.2 MB C        2:56:45      2:56:37 wn-pijl-007
24909.65  templon     ferm.condor 322     6/15 12:56   1  66.3 MB    73.2 MB C        2:55:30      2:55:20 wn-pijl-001
24301.30  templon     mers.condor 223     6/15 12:39   1  21.6 MB    24.4 MB C              0     20:17:20 wn-sate-073
24909.275 templon     ferm.condor 532     6/15 12:21   1  60.2 MB    73.2 MB C        2:19:47      2:19:41 wn-pijl-005
24909.102 templon     ferm.condor 359     6/15 12:18   1  10.2 MB    12.2 MB C              0      2:16:50 wn-pijl-004
24909.635 templon     ferm.condor 892     6/15 12:13   1 344.0 KB    12.2 MB C              0            1 wn-sate-074
24909.320 templon     ferm.condor 577     6/15 12:13   1  67.2 MB    73.2 MB C        2:11:56      2:11:51 wn-pijl-004
24909.105 templon     ferm.condor 362     6/15 12:12   1  60.0 MB    73.2 MB C        2:11:16      2:11:10 wn-pijl-003
24909.233 templon     ferm.condor 490     6/15 12:04   1  67.2 MB    73.2 MB C        2:03:28      2:03:23 wn-pijl-003

JT


On 15 Jun 2024, at 13:43, Jeff Templon <templon@xxxxxxxxx> wrote:

Hi Thomas,

On 13 Jun 2024, at 14:46, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:

Hi Jeff,

regarding the wall time: do you have cores unallocated or so?

Yes, we have cores unallocated.  However, my job is really extremely CPU bound, it’s a golang executable that is finding prime factors of numbers, there is really nothing else to use any CPU.

I looked further into it.  Here is some example information from a recent job:

This is what /bin/time says:

./factorize -M 193  74671.89 user 144.01 system 100% cpu 20:43:27 total

If I look at the HTCondor history for this job, I get a Walltime of 20:43:28 which agrees pretty perfectly with what /bin/time prints.  OTOH, RemoteUserCPU is zero for the job, and CpusUsage is 1.003470910485951, leading to a calculated CPU time of 20:47:46, quite a bit more than /bin/time is printing.  My suspicion is that CPU time is being calculated on the entire process tree:

  │       ├─condor_starter -f -local-name slot_type_1 -a slot1_27 taai-007.nikhef.nl
  │       │   └─starter
  │       │       ├─appinit
  │       │       │   ├─ferm.condor /user/templon/gofact_er_ng/ferm.condor 842
  │       │       │   │   └─time -f %C  %U user %S system %P cpu %E total ./factorize -F 842
  │       │       │   │       └─factorize -F 842
  │       │       │   │           └─2*[{factorize}]
  │       │       │   └─6*[{appinit}]

In other words, when CPU utilisation of the job (for me, the job means the thing specified in Executable, which is the ferm.condor above) is close to 100%, it can no longer be really trusted.  For a job with CPU utilisation of say 80%, the difference doesn’t matter so much.

I’m used to Torque which has a much more unitary approach to CPU and Wall time accounting.  I’m realising as well, just now as I write this, that with Torque, our jobs are bare metal. So the executable is the executable.  On our HTCondor system, everything is running in a container - possibly the difference is in whatever the container is doing during the job, i.e. dealing with file I/O passing the container boundaries.

JT


Since Condor uses cpu weights but not pinning by default, jobs get a relative share of the overall core weighted CPU time (but not *a* core per se). So, if there are cpu cycles free, a process with a relative CPU time share can be more cycle time than nominally assigned. So my guess would be that you have either not all cores as slots or that some jobs are idling around for some time and that other jobs can jump onto their cycles.

Cheers,
 Thomas

On 13/06/2024 13.13, Jeff Templon wrote:
Hi,
Looking at the docs, I’d expect that (for single core jobs),
- cpu time will be less than wall time
- wall time * CpusUsage = cpu time
Some observations (using data from condor_history) after the job has completed, in some cases an hour or more ago:
- cpu time is missing (not always, but often)
- CpusUsage is more often than not slightly more than 1.00 (see attachment)
Is this a bug, or an unconventional definition of the various pieces, or some factor I’ve failed to account for?
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/