[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Weirdness with cpususage



Yes, some of these values look broken. Let me ask a few questions to eliminate some potentially confounding effects:

Are these jobs running in docker universe?

Are all of the jobs running on the same or similarly-configured execution points?

What version of HTCondor is running on the execution points?

Use of docker and cgroups changes how the cpu usage information is collected.

 - Jaime

On Oct 30, 2024, at 4:17âAM, Jeff Templon <templon@xxxxxxxxx> wrote:

Thanks Jaime,

Your explanation is helpful, it doesnât explain the effect AFAICT, do you agree?  See here:

â> condor_history  -completedsince $(date -d "4 days ago" +"%s") -print-format ~templon/done-cpususage.cpf -wide:148 | head
JOB_ID    Username    CMD               Finished   CPUS  CPuse  MEMREQ    RAM      MEM     ST   CPU_TIME    LWALL_TIME    WALL_TIME  WorkerNode
715700.0 roystege    pineko theory -c / 10/30 10:00  32 3.612 264.0 GB   7.2 GB     7.2 GB C       1:33:59         4:25         4:25 wn-knek-014
715686.0 roystege    pineko theory -c / 10/30 09:59  32 31.38 264.0 GB   9.8 MB     9.8 MB C             0      1:44:41      1:44:41 wn-lot-064
715557.0 roystege    pineko theory -c / 10/30 09:59  32 0.280 264.0 GB   9.5 GB     9.5 GB C    9+02:42:51     11:13:46     11:13:46 wn-knek-016
715699.0 roystege    pineko theory -c / 10/30 09:55  32   0.0 264.0 GB   3.4 MB   195.3 MB C             6            4            4 wn-knek-014

LWALL_TIME and WALL_TIME - the âLâ is for the âLastâ as you suggest.  They are identical in these cases.  See the second line above : âCPuseâ is 31.38, according to you itâs "computed from the cpu usage and wall clock time of just the last execution of the job (no file transfer time included)â.  The CPU used is â0â according to the output.  The next line, the CPU TIME is about 18 times the wall times, but CPuse is only 0.280.  Do you see how both of these are possible without something being wrong?

Thanks,

JT

Ps see the CPF file below, itâs slightly different than in the previous mail

SELECT NOSUMMARY
   ClusterId                      AS  JOB_ID      PRINTAS JOB_ID
   Owner                          AS '   Username'
   join(" ",split(Cmd,"/")[size(split(Cmd,"/"))-1], Args)  AS '   CMD' WIDTH -18
   CompletionDate                 AS '  Finished ' PRINTAS DATE
   CpusProvisioned                AS ' CPUS'  WIDTH 3
   CpusUsage                      AS ' CPuse' WIDTH 5
   MemoryProvisioned              AS ' MEMREQ' PRINTAS READABLE_MB  
   ResidentSetSize                AS '   RAM' PRINTAS READABLE_KB  WIDTH 8
   ImageSize                      AS '   MEM' PRINTAS READABLE_KB  WIDTH 10
   JobStatus                      AS "ST"        PRINTAS JOB_STATUS
   interval(RemoteUserCpu)        AS  "  CPU_TIME"    WIDTH 12
   interval(LastRemoteWallClockTime)  AS  " LWALL_TIME"   WIDTH 12
   interval(RemoteWallClockTime)  AS  "  WALL_TIME"   WIDTH 12
   split(splitSlotName(LastRemoteHost)[1], ".")[0]  AS  "WorkerNode" WIDTH -12


On 25 Oct 2024, at 18:08, Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

There are a few confounding factors with using these job attributes. RemoteWallClockTime records the time across all execution attempts for the job, whereas RemoteUserCpu records the time from just the last execution attempt. To get the wall clock time for just the last execution, you should use LastRemoteWallClockTime. Also, RemoteWallClockTime and LastRemoteWallClockTime include time spent doing file transfer.
CpusUsage is computed from the cpu usage and wall clock time of just the last execution of the job (no file transfer time included).

 - Jaime