Hi Max,
I am not positive, but my recollection is the CpusUsage attribute
in the job ad is averaged over the lifetime of the job. This is
also what the documentation for this job attribute states at
https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html
Some quick observations / questions:
Re _expression_ [1] below, if you job policy allows for job
suspension, this _expression_ does not take suspension into account,
if that is something your site policy utilizes.
Re _expression_ the discrepancies between _expression_ [2] and
CpusUsage from condor_history, this is very interesting....
question: did you limit the jobs you inspected just to vanilla
jobs that completed normally, or did your analysis include jobs
that were removed, scheduler universe jobs, etc ? Any notable
correlation between jobs from history that had a sensible
correlation vs bogus, such as all the bogus jobs were container
universe or ?
Thank you for sharing, and hope this helps,
best regards,
Todd
On 3/6/2024 3:04 AM, Fischer, Max (SCC) wrote:
Hi all,
We are currently putting an extra close eye on CPU usage and Iâm a bit confused by the options available (letâs not delve into what is âtheâ CPU usage). Iâm using both the inbuilt CPUsUsage (via ProcFamily CgroupV1) and computed expressions for running [1] and completed [2] jobs. Of course they donât quite agree so Iâm interested if Iâm doing it right and if anyone has better suggestions.
For running jobs the CPUsUsage is consistently higher than the computed value but rather close (e.g. 9.02 vs 8.7).
- The docs for CPUsUsage say itâs the one-minute CPU usage. However, if Iâm reading the code [3] right only the total cpu metrics for the entire job cgroup are collected and used. So is CPUsUsage over a specific time range or the entire lifetime of the job?
- Is my _expression_ basically replicating what CPUsUsage is doing and just limited by timing resolution?
For completed jobs the CPUsUsage is sometimes sensible (e.g. 7.86 vs 7.78) but oftentimes completely bogus (e.g. 0.11 vs 7.22).
- Is the CPUsUsage actually meaningful in the history?
- Can we somehow record the peak or average CPUsUsage in history?
Cheers,
Max
[1] _expression_ for condor_q -run
'(RemoteSysCPU + RemoteUserCpu) / (ServerTime - JobCurrentStartDate)'
[2] _expression_ for condor_history
'(CumulativeRemoteSysCpu + CumulativeRemoteUserCpu) / (RemoteWallClockTime - CumulativeSuspensionTime)â
[3] ProcFamilyDirectCgroupV1::get_usage
https://github.com/htcondor/htcondor/blob/66aadf0278a07ee219eaa184068403c7dee1db4d/src/condor_utils/proc_family_direct_cgroup_v1.cpp#L292-L338
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/