[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Weirdness with cputime, wall time and cpususage



Hi Thomas,

On 13 Jun 2024, at 14:46, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:

Hi Jeff,

regarding the wall time: do you have cores unallocated or so?

Yes, we have cores unallocated.  However, my job is really extremely CPU bound, itâs a golang executable that is finding prime factors of numbers, there is really nothing else to use any CPU.

I looked further into it.  Here is some example information from a recent job:

This is what /bin/time says:

./factorize -M 193  74671.89 user 144.01 system 100% cpu 20:43:27 total

If I look at the HTCondor history for this job, I get a Walltime of 20:43:28 which agrees pretty perfectly with what /bin/time prints.  OTOH, RemoteUserCPU is zero for the job, and CpusUsage is 1.003470910485951, leading to a calculated CPU time of 20:47:46, quite a bit more than /bin/time is printing.  My suspicion is that CPU time is being calculated on the entire process tree:

  â       ââcondor_starter -f -local-name slot_type_1 -a slot1_27 taai-007.nikhef.nl
  â       â   ââstarter
  â       â       ââappinit
  â       â       â   ââferm.condor /user/templon/gofact_er_ng/ferm.condor 842
  â       â       â   â   ââtime -f %C  %U user %S system %P cpu %E total ./factorize -F 842
  â       â       â   â       ââfactorize -F 842
  â       â       â   â           ââ2*[{factorize}]
  â       â       â   ââ6*[{appinit}]

In other words, when CPU utilisation of the job (for me, the job means the thing specified in Executable, which is the ferm.condor above) is close to 100%, it can no longer be really trusted.  For a job with CPU utilisation of say 80%, the difference doesnât matter so much.

Iâm used to Torque which has a much more unitary approach to CPU and Wall time accounting.  Iâm realising as well, just now as I write this, that with Torque, our jobs are bare metal. So the executable is the executable.  On our HTCondor system, everything is running in a container - possibly the difference is in whatever the container is doing during the job, i.e. dealing with file I/O passing the container boundaries.

JT


Since Condor uses cpu weights but not pinning by default, jobs get a relative share of the overall core weighted CPU time (but not *a* core per se). So, if there are cpu cycles free, a process with a relative CPU time share can be more cycle time than nominally assigned. So my guess would be that you have either not all cores as slots or that some jobs are idling around for some time and that other jobs can jump onto their cycles.

Cheers,
 Thomas

On 13/06/2024 13.13, Jeff Templon wrote:
Hi,
Looking at the docs, Iâd expect that (for single core jobs),
- cpu time will be less than wall time
- wall time * CpusUsage = cpu time
Some observations (using data from condor_history) after the job has completed, in some cases an hour or more ago:
- cpu time is missing (not always, but often)
- CpusUsage is more often than not slightly more than 1.00 (see attachment)
Is this a bug, or an unconventional definition of the various pieces, or some factor Iâve failed to account for?
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/