$ condor_history -completedsince $(date -d "8 hour ago" +"%s") -limit 512 -print-format ~templon/done.cpf -wide:132 -limit 16 templon
JOB_ID Username CMD Finished CPUS RAM MEM ST CPU_TIME WALL_TIME WorkerNode
24909.528 templon ferm.condor 785 6/15 16:14 1 26.2 MB 26.9 MB C 0 6:11:06 wn-pijl-003
24909.50 templon ferm.condor 307 6/15 13:53 1 63.9 MB 73.2 MB C 3:51:55 3:51:44 wn-pijl-007
24909.115 templon ferm.condor 372 6/15 13:51 1 28.0 MB 29.3 MB C 0 3:49:40 wn-pijl-002
24909.100 templon ferm.condor 357 6/15 13:44 1 65.7 MB 73.2 MB C 3:43:19 3:43:09 wn-pijl-001
24909.151 templon ferm.condor 408 6/15 13:39 1 73.7 MB 97.7 MB C 3:38:20 3:38:11 wn-pijl-004
24301.0 templon mers.condor 193 6/15 13:05 1 10.2 MB 12.2 MB C 0 20:43:28 wn-pijl-006
24909.626 templon ferm.condor 883 6/15 12:59 1 19.3 MB 22.0 MB C 0 2:03:38 wn-sate-068
24909.71 templon ferm.condor 328 6/15 12:58 1 63.1 MB 73.2 MB C 2:56:45 2:56:37 wn-pijl-007
24909.65 templon ferm.condor 322 6/15 12:56 1 66.3 MB 73.2 MB C 2:55:30 2:55:20 wn-pijl-001
24301.30 templon mers.condor 223 6/15 12:39 1 21.6 MB 24.4 MB C 0 20:17:20 wn-sate-073
24909.275 templon ferm.condor 532 6/15 12:21 1 60.2 MB 73.2 MB C 2:19:47 2:19:41 wn-pijl-005
24909.102 templon ferm.condor 359 6/15 12:18 1 10.2 MB 12.2 MB C 0 2:16:50 wn-pijl-004
24909.635 templon ferm.condor 892 6/15 12:13 1 344.0 KB 12.2 MB C 0 1 wn-sate-074
24909.320 templon ferm.condor 577 6/15 12:13 1 67.2 MB 73.2 MB C 2:11:56 2:11:51 wn-pijl-004
24909.105 templon ferm.condor 362 6/15 12:12 1 60.0 MB 73.2 MB C 2:11:16 2:11:10 wn-pijl-003
24909.233 templon ferm.condor 490 6/15 12:04 1 67.2 MB 73.2 MB C 2:03:28 2:03:23 wn-pijl-003
JT
On 15 Jun 2024, at 13:43, Jeff Templon <templon@xxxxxxxxx> wrote:
Hi Thomas,
On 13 Jun 2024, at 14:46, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
Hi Jeff,
regarding the wall time: do you have cores unallocated or so?
Yes, we have cores unallocated. However, my job is really extremely CPU bound, itâs a golang executable that is finding prime factors of numbers, there is really nothing else to use any CPU.
I looked further into it. Here is some example information from a recent job:
This is what /bin/time says:
./factorize -M 193 74671.89 user 144.01 system 100% cpu 20:43:27 total
If I look at the HTCondor history for this job, I get a Walltime of 20:43:28 which agrees pretty perfectly with what /bin/time prints. OTOH, RemoteUserCPU is zero for the job, and CpusUsage is 1.003470910485951, leading to a calculated CPU time of 20:47:46, quite a bit more than /bin/time is printing. My suspicion is that CPU time is being calculated on the entire process tree:
â ââcondor_starter -f -local-name slot_type_1 -a slot1_27 taai-007.nikhef.nl
â â ââstarter
â â ââappinit
â â â ââferm.condor /user/templon/gofact_er_ng/ferm.condor 842
â â â â ââtime -f %C %U user %S system %P cpu %E total ./factorize -F 842
â â â â ââfactorize -F 842
â â â â ââ2*[{factorize}]
â â â ââ6*[{appinit}]
In other words, when CPU utilisation of the job (for me, the job means the thing specified in Executable, which is the ferm.condor above) is close to 100%, it can no longer be really trusted. For a job with CPU utilisation of say 80%, the difference doesnât matter so much.
Iâm used to Torque which has a much more unitary approach to CPU and Wall time accounting. Iâm realising as well, just now as I write this, that with Torque, our jobs are bare metal. So the executable is the executable. On our HTCondor system, everything is running in a container - possibly the difference is in whatever the container is doing during the job, i.e. dealing with file I/O passing the container boundaries.
JT
Since Condor uses cpu weights but not pinning by default, jobs get a relative share of the overall core weighted CPU time (but not *a* core per se). So, if there are cpu cycles free, a process with a relative CPU time share can be more cycle time than nominally assigned. So my guess would be that you have either not all cores as slots or that some jobs are idling around for some time and that other jobs can jump onto their cycles.
Cheers,
Thomas
On 13/06/2024 13.13, Jeff Templon wrote:
Hi,
Looking at the docs, Iâd expect that (for single core jobs),
- cpu time will be less than wall time
- wall time * CpusUsage = cpu time
Some observations (using data from condor_history) after the job has completed, in some cases an hour or more ago:
- cpu time is missing (not always, but often)
- CpusUsage is more often than not slightly more than 1.00 (see attachment)
Is this a bug, or an unconventional definition of the various pieces, or some factor Iâve failed to account for?
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/