[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Print formats and job cpu statistics : the deeper I look, the worse it gets



Hi,

Whatâs going on here??

$ condor_q -nobatch -pr $HOME/nan3.cpf  -allusers | grep -E "CPU|fgi"                                                                                                                      JOB_ID      User CPUS     MEMREQ        MEMUSE     ST     funCPU       intCPU        fWALL      intWALL      SlotTime Num_Starts   WorkerNode
2115385.0   fgit 64        16.0 GB      7.2 GB      R        0+19:52:05 40+15:55:39   0+19:52:05   2+17:16:13 19:52:05        2 wn-pijl-005.

Print format file:

$ cat nan3.cpf                                                                                                                                                                             SELECT NOSUMMARY
   ClusterId                     AS  JOB_ID  PRINTAS JOB_ID
   Owner                         AS User  WIDTH 4
   CpusProvisioned               AS CPUS
   MemoryProvisioned             AS "    MEMREQ  "   PRINTAS READABLE_MB
   MemoryUsage                   AS "     MEMUSE"    PRINTAS READABLE_MB
   JobStatus                     AS "    ST"         PRINTAS JOB_STATUS
   RemoteUserCpu                 AS "    funCPU"   PRINTAS CPU_TIME
   interval(RemoteUserCpu)       AS "    intCPU"
   RemoteWallClockTime           AS "      fWALL"   PRINTAS CPU_TIME
   interval(RemoteWallClockTime) AS "    intWALL"   WIDTH 12
   interval(ServerTime-JobCurrentStartDate) AS "    SlotTime"
   JobRunCount                    AS 'Num_Starts' WIDTH 4
   splitSlotName(RemoteHost)[1] OR "-" AS "  WorkerNode"  WIDTH -12

1. It looks like PRINTAS CPU_TIME is not PRINTAS, itâs PRINT â it doesnât matter what time attribute I specify, I get the same output value.  Not clear what kind of bug it is - maybe this is the intent, but then it does not belong among PRINTAS statements, see READABLE_MB above - that one is used twice, and the output value is clearly depends on the given attribute value.

2. The thing I am calling SlotTime is correct, because I know when this job most recently started, 19+ hours is correct. RemoteWallClockTime may be correct, but using the recommended recipe to to get the most current runâs WallClockTime is failing:  CommittedTime is zero (as are CommittedSlotTime and CumulativeSuspensionTime) for this job.  LastRemoteWallClockTime is identical to RemoteWallClockTime.

3. I recommend a cleanup of the attributes at some point. RemoteUserCpu is only for the current run, but RemoteWallClockTime is cumulative.  Maybe this confusion is responsible for the errors described above, either in my understanding or in the code itself.

Have a good weekend,

JT