[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Problems with CPU recording



Weâve fixed some issues with CPU usage reporting, but I donât believe we ever got the bottom of some the mismatched numbers you observed in the past. We have more work to do to understand why these numbers disagree for some jobs.

 - Jaime

On Jun 13, 2025, at 4:34âAM, Jeff Templon <templon@xxxxxxxxx> wrote:

Hi,

See this listing (cpf file attached):

$ condor_history  -completedsince $(date -d "12 hours ago" +"%s") -print-format ~templon/anon-cpususage.cpf -wide:156 -constraint 'Cpususage > 2'
JOB_ID      Class    CMD               Finished     Started CPUS CPuse MEMREQ    RAM      MEM     ST   CPU_TIME   WALL_TIME NStrt WorkerNode
2125923.0   mediu python /data/gravw              6/11 12:05  32 21.26 16.0 GB   9.5 GB     9.5 GB  X  35+12:05:5 1+00:03:44     1 wn-pijl-005
2125922.0   mediu python /data/gravw              6/11 12:05  32 3.444 16.0 GB   9.5 GB     9.5 GB  X  35+00:55:3 1+00:02:39     1 wn-pijl-005
2125920.0   mediu python /data/gravw              6/11 12:05  32 2.001 16.0 GB   9.5 GB     9.5 GB  X  36+05:59:3 1+00:03:10     1 wn-pijl-005
2125919.0   mediu python /data/gravw              6/11 12:05  32 9.021 16.0 GB   9.5 GB     9.5 GB  X  36+02:26:0 1+00:04:07     1 wn-pijl-005
2134378.0   short FST_FST_for_Drunca              6/12 13:45   1 4.061 16.0 GB  14.3 GB    14.3 GB  X    16:07:06    4:00:41     1 wn-pijl-006
2133940.0   mediu FST_test2_692971.s              6/12 13:12   1 4.027 16.0 GB  16.7 GB    16.7 GB  X    21:25:18    5:20:19     1 wn-pijl-007
2140524.0   short run_sin2beta.sh     6/13 02:53  6/13 02:08   4 2.479 16.0 GB   7.2 GB     9.5 GB  C     2:33:04      45:13     1 wn-sate-071
2128723.246 mediu exec_n3fit_exit0.s  6/13 02:36  6/12 13:18   4 3.403 16.0 GB   7.2 GB     7.2 GB  C    14:21:07   13:18:14     1 wn-lot-050
2140372.0   short run_sin2beta.sh     6/13 01:57  6/13 01:42   4 8.551 16.0 GB   7.2 GB     9.5 GB  C     2:59:09      15:11     1 wn-sate-060
2128723.100 mediu exec_n3fit_exit0.s  6/13 01:23  6/12 12:55   4 3.376 16.0 GB   7.2 GB     7.2 GB  C    13:33:17   12:27:46     1 wn-lot-064
2139134.0   short bash ./run.sh       6/13 01:01  6/12 22:08   8 3.114 10.0 GB 293.0 MB   293.0 MB  C     9:30:34    2:52:36     1 wn-sate-062
2139134.1   short bash ./run.sh       6/13 00:41  6/12 22:08   8 3.481 10.0 GB 293.0 MB   293.0 MB  C     8:43:05    2:32:43     1 wn-sate-058
2138903.1   short bash ./run.sh       6/13 00:39  6/12 21:31   5 2.796 16.0 GB 341.8 MB   341.8 MB  C     7:20:52    3:07:57     1 wn-sate-057
2138903.4   short bash ./run.sh       6/13 00:32  6/12 21:31   5 2.270 16.0 GB 341.8 MB   341.8 MB  C     6:52:26    3:00:55     1 wn-sate-072
2135158.0   mediu job.sh              6/13 00:10  6/12 14:34   4 3.600 4.0 GB  732.4 MB   732.4 MB  C    14:54:50    9:35:56     1 wn-sate-063
2128721.110 mediu exec_n3fit_exit0.s  6/13 00:01  6/12 12:21   4 2.770 16.0 GB   7.2 GB     7.2 GB  C    12:24:30   11:40:20     1 wn-lot-051
2139134.4   short bash ./run.sh       6/12 23:49  6/12 22:08   8 8.000 10.0 GB 317.4 MB   317.4 MB  C    13:20:51    1:40:39     1 wn-lot-037
2139134.3   short bash ./run.sh       6/12 23:49  6/12 22:08   8 7.999 10.0 GB 317.4 MB   317.4 MB  C    13:20:01    1:40:36     1 wn-lot-043
2139134.2   short bash ./run.sh       6/12 23:42  6/12 22:08   8 8.000 10.0 GB 317.4 MB   317.4 MB  C    12:22:45    1:33:26     1 wn-lot-040
2128721.202 mediu exec_n3fit_exit0.s  6/12 23:36  6/12 12:33   4 3.800 16.0 GB   7.2 GB     7.2 GB  C    11:52:27   11:03:30     1 wn-lot-004

You can see that CpusProvisioned, CpusUsage, RemoteUserCpu, and RemoteWallClockTime are not correctly related in about half the cases. I thought this was supposed to be fixed in the recent condor versions?  Weâre running 24.6.0-1.  Any ideas?

JT

$ cat anon-cpususage.cpf
SELECT NOSUMMARY
   ClusterId                      AS  JOB_ID      PRINTAS JOB_ID WIDTH -11
   JobCategory                    AS Class WIDTH 5
   join(" ",split(Cmd,"/")[size(split(Cmd,"/"))-1], Args)  AS '   CMD' WIDTH -18
   CompletionDate                 AS '  Finished ' PRINTAS DATE
   JobCurrentStartDate            AS '   Started' PRINTAS QDATE
   CpusProvisioned                AS 'CPUS'  WIDTH 3
   CpusUsage                      AS 'CPuse' WIDTH 5
   MemoryProvisioned              AS 'MEMREQ' PRINTAS READABLE_MB
   ResidentSetSize                AS '   RAM' PRINTAS READABLE_KB  WIDTH 8
   ImageSize                      AS '   MEM' PRINTAS READABLE_KB  WIDTH 10
   JobStatus                      AS "ST"        PRINTAS JOB_STATUS WIDTH 3
   interval(RemoteUserCpu)        AS  " CPU_TIME"    WIDTH 10
   interval(RemoteWallClockTime)  AS  " WALL_TIME"   WIDTH 10
   JobRunCount                    AS 'NStrt' WIDTH 5
   split(splitSlotName(LastRemoteHost)[1], ".")[0]  AS  "WorkerNode" WIDTH -18

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

Join us in June at Throughput Computing 25: https://urldefense.com/v3/__https://osg-htc.org/htc25__;!!Mak6IKo!OuBkAloF9i3apmBGPoPKuSggTG08ynXSEnvEbNMDIfik8pmPdSQczvB4geTXryWmd4EGP5I91gAW-OlQ0eo$

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/