Re: [HTCondor-users] Weirdness with cputime, wall time and cpususage

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

$ condor_history -completedsince $(date -d "8 hour ago" +"%s") -limit 512 -print-format ~templon/done.cpf -wide:132 -limit 16 templon

JOB_ID Username CMD Finished CPUS RAM MEM ST CPU_TIME WALL_TIME WorkerNode

24909.528 templon ferm.condor 785 6/15 16:14 1 26.2 MB 26.9 MB C 0 6:11:06 wn-pijl-003

24909.50 templon ferm.condor 307 6/15 13:53 1 63.9 MB 73.2 MB C 3:51:55 3:51:44 wn-pijl-007

24909.115 templon ferm.condor 372 6/15 13:51 1 28.0 MB 29.3 MB C 0 3:49:40 wn-pijl-002

24909.100 templon ferm.condor 357 6/15 13:44 1 65.7 MB 73.2 MB C 3:43:19 3:43:09 wn-pijl-001

24909.151 templon ferm.condor 408 6/15 13:39 1 73.7 MB 97.7 MB C 3:38:20 3:38:11 wn-pijl-004

24301.0 templon mers.condor 193 6/15 13:05 1 10.2 MB 12.2 MB C 0 20:43:28 wn-pijl-006

24909.626 templon ferm.condor 883 6/15 12:59 1 19.3 MB 22.0 MB C 0 2:03:38 wn-sate-068

24909.71 templon ferm.condor 328 6/15 12:58 1 63.1 MB 73.2 MB C 2:56:45 2:56:37 wn-pijl-007

24909.65 templon ferm.condor 322 6/15 12:56 1 66.3 MB 73.2 MB C 2:55:30 2:55:20 wn-pijl-001

24301.30 templon mers.condor 223 6/15 12:39 1 21.6 MB 24.4 MB C 0 20:17:20 wn-sate-073

24909.275 templon ferm.condor 532 6/15 12:21 1 60.2 MB 73.2 MB C 2:19:47 2:19:41 wn-pijl-005

24909.102 templon ferm.condor 359 6/15 12:18 1 10.2 MB 12.2 MB C 0 2:16:50 wn-pijl-004

24909.635 templon ferm.condor 892 6/15 12:13 1 344.0 KB 12.2 MB C 0 1 wn-sate-074

24909.320 templon ferm.condor 577 6/15 12:13 1 67.2 MB 73.2 MB C 2:11:56 2:11:51 wn-pijl-004

24909.105 templon ferm.condor 362 6/15 12:12 1 60.0 MB 73.2 MB C 2:11:16 2:11:10 wn-pijl-003

24909.233 templon ferm.condor 490 6/15 12:04 1 67.2 MB 73.2 MB C 2:03:28 2:03:23 wn-pijl-003

On 15 Jun 2024, at 13:43, Jeff Templon <templon@xxxxxxxxx> wrote:

Hi Thomas,

On 13 Jun 2024, at 14:46, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:

Hi Jeff,

regarding the wall time: do you have cores unallocated or so?

Yes, we have cores unallocated. However, my job is really extremely CPU bound, itâs a golang executable that is finding prime factors of numbers, there is really nothing else to use any CPU.

I looked further into it. Here is some example information from a recent job:

This is what /bin/time says:

./factorize -M 193 74671.89 user 144.01 system 100% cpu 20:43:27 total

If I look at the HTCondor history for this job, I get a Walltime of 20:43:28 which agrees pretty perfectly with what /bin/time prints. OTOH, RemoteUserCPU is zero for the job, and CpusUsage is 1.003470910485951, leading to a calculated CPU time of 20:47:46, quite a bit more than /bin/time is printing. My suspicion is that CPU time is being calculated on the entire process tree:

â ââcondor_starter -f -local-name slot_type_1 -a slot1_27 taai-007.nikhef.nl
â â ââstarter
â â ââappinit
â â â ââferm.condor /user/templon/gofact_er_ng/ferm.condor 842
â â â â ââtime -f %C %U user %S system %P cpu %E total ./factorize -F 842
â â â â ââfactorize -F 842
â â â â ââ2*[{factorize}]
â â â ââ6*[{appinit}]

In other words, when CPU utilisation of the job (for me, the job means the thing specified in Executable, which is the ferm.condor above) is close to 100%, it can no longer be really trusted. For a job with CPU utilisation of say 80%, the difference doesnât matter so much.

Iâm used to Torque which has a much more unitary approach to CPU and Wall time accounting. Iâm realising as well, just now as I write this, that with Torque, our jobs are bare metal. So the executable is the executable. On our HTCondor system, everything is running in a container - possibly the difference is in whatever the container is doing during the job, i.e. dealing with file I/O passing the container boundaries.

JT

Since Condor uses cpu weights but not pinning by default, jobs get a relative share of the overall core weighted CPU time (but not *a* core per se). So, if there are cpu cycles free, a process with a relative CPU time share can be more cycle time than nominally assigned. So my guess would be that you have either not all cores as slots or that some jobs are idling around for some time and that other jobs can jump onto their cycles.

Cheers,
Thomas

On 13/06/2024 13.13, Jeff Templon wrote:
Hi,
Looking at the docs, Iâd expect that (for single core jobs),
- cpu time will be less than wall time
- wall time * CpusUsage = cpu time
Some observations (using data from condor_history) after the job has completed, in some cases an hour or more ago:
- cpu time is missing (not always, but often)
- CpusUsage is more often than not slightly more than 1.00 (see attachment)
Is this a bug, or an unconventional definition of the various pieces, or some factor Iâve failed to account for?
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Weirdness with cputime, wall time and cpususage