Hi Todd,
Thanks for the reality check, now I feel a bit embarrassed - no idea where I read the one minute average thing. -_-;
As for the discrepancy between CpusUsage and _expression_ [2]: Iâm only looking at our grid queue, so no suspensions, no scheduler universe, or other obvious âexoticâ situations. All execution points are practically the same and running on HTCondor 10, but the entry points still are HTCondor 9 (due to them being CEs with GSI auth).
Some things Iâve checked: - It occurs for about 25%-30% of jobs. - NumJobMatches and NumShadowStarts are 1 so no mismatch between cumulative and recent execution. - CommittedTime is usually way more than an hour, so this shouldnât be an initialisation issue. - No relation to submitter, CPU or memory request. No relation to few, specific machines. No relation to failed/completed jobs. - All computations (C and ClassAd) are 64 bit ints and doubles holding real time ranges, so overflow or precision should not be an issue.
Let me know if you have any other idea what I could double-check.
Cheers, Max On 6. Mar 2024, at 20:19, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
Hi Max,
I am not positive, but my recollection is the CpusUsage attribute
in the job ad is averaged over the lifetime of the job. This is
also what the documentation for this job attribute states at
https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html
Some quick observations / questions:
Re _expression_ [1] below, if you job policy allows for job
suspension, this _expression_ does not take suspension into account,
if that is something your site policy utilizes.
Re _expression_ the discrepancies between _expression_ [2] and
CpusUsage from condor_history, this is very interesting....
question: did you limit the jobs you inspected just to vanilla
jobs that completed normally, or did your analysis include jobs
that were removed, scheduler universe jobs, etc ? Any notable
correlation between jobs from history that had a sensible
correlation vs bogus, such as all the bogus jobs were container
universe or ?
Thank you for sharing, and hope this helps,
best regards,
Todd
On 3/6/2024 3:04 AM, Fischer, Max (SCC) wrote:
Hi all,
We are currently putting an extra close eye on CPU usage and Iâm a bit confused by the options available (letâs not delve into what is âtheâ CPU usage). Iâm using both the inbuilt CPUsUsage (via ProcFamily CgroupV1) and computed expressions for running [1] and completed [2] jobs. Of course they donât quite agree so Iâm interested if Iâm doing it right and if anyone has better suggestions.
For running jobs the CPUsUsage is consistently higher than the computed value but rather close (e.g. 9.02 vs 8.7).
- The docs for CPUsUsage say itâs the one-minute CPU usage. However, if Iâm reading the code [3] right only the total cpu metrics for the entire job cgroup are collected and used. So is CPUsUsage over a specific time range or the entire lifetime of the job?
- Is my _expression_ basically replicating what CPUsUsage is doing and just limited by timing resolution?
For completed jobs the CPUsUsage is sometimes sensible (e.g. 7.86 vs 7.78) but oftentimes completely bogus (e.g. 0.11 vs 7.22).
- Is the CPUsUsage actually meaningful in the history?
- Can we somehow record the peak or average CPUsUsage in history?
Cheers,
Max
[1] _expression_ for condor_q -run
'(RemoteSysCPU + RemoteUserCpu) / (ServerTime - JobCurrentStartDate)'
[2] _expression_ for condor_history
'(CumulativeRemoteSysCpu + CumulativeRemoteUserCpu) / (RemoteWallClockTime - CumulativeSuspensionTime)â
[3] ProcFamilyDirectCgroupV1::get_usage
https://github.com/htcondor/htcondor/blob/66aadf0278a07ee219eaa184068403c7dee1db4d/src/condor_utils/proc_family_direct_cgroup_v1.cpp#L292-L338
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685
|