[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Problems with CPU recording



Hi,

See this listing (cpf file attached):

$ condor_history  -completedsince $(date -d "12 hours ago" +"%s") -print-format ~templon/anon-cpususage.cpf -wide:156 -constraint 'Cpususage > 2'
JOB_ID      Class    CMD               Finished     Started CPUS CPuse MEMREQ    RAM      MEM     ST   CPU_TIME   WALL_TIME NStrt WorkerNode
2125923.0   mediu python /data/gravw              6/11 12:05  32 21.26 16.0 GB   9.5 GB     9.5 GB  X  35+12:05:5 1+00:03:44     1 wn-pijl-005
2125922.0   mediu python /data/gravw              6/11 12:05  32 3.444 16.0 GB   9.5 GB     9.5 GB  X  35+00:55:3 1+00:02:39     1 wn-pijl-005
2125920.0   mediu python /data/gravw              6/11 12:05  32 2.001 16.0 GB   9.5 GB     9.5 GB  X  36+05:59:3 1+00:03:10     1 wn-pijl-005
2125919.0   mediu python /data/gravw              6/11 12:05  32 9.021 16.0 GB   9.5 GB     9.5 GB  X  36+02:26:0 1+00:04:07     1 wn-pijl-005
2134378.0   short FST_FST_for_Drunca              6/12 13:45   1 4.061 16.0 GB  14.3 GB    14.3 GB  X    16:07:06    4:00:41     1 wn-pijl-006
2133940.0   mediu FST_test2_692971.s              6/12 13:12   1 4.027 16.0 GB  16.7 GB    16.7 GB  X    21:25:18    5:20:19     1 wn-pijl-007
2140524.0   short run_sin2beta.sh     6/13 02:53  6/13 02:08   4 2.479 16.0 GB   7.2 GB     9.5 GB  C     2:33:04      45:13     1 wn-sate-071
2128723.246 mediu exec_n3fit_exit0.s  6/13 02:36  6/12 13:18   4 3.403 16.0 GB   7.2 GB     7.2 GB  C    14:21:07   13:18:14     1 wn-lot-050
2140372.0   short run_sin2beta.sh     6/13 01:57  6/13 01:42   4 8.551 16.0 GB   7.2 GB     9.5 GB  C     2:59:09      15:11     1 wn-sate-060
2128723.100 mediu exec_n3fit_exit0.s  6/13 01:23  6/12 12:55   4 3.376 16.0 GB   7.2 GB     7.2 GB  C    13:33:17   12:27:46     1 wn-lot-064
2139134.0   short bash ./run.sh       6/13 01:01  6/12 22:08   8 3.114 10.0 GB 293.0 MB   293.0 MB  C     9:30:34    2:52:36     1 wn-sate-062
2139134.1   short bash ./run.sh       6/13 00:41  6/12 22:08   8 3.481 10.0 GB 293.0 MB   293.0 MB  C     8:43:05    2:32:43     1 wn-sate-058
2138903.1   short bash ./run.sh       6/13 00:39  6/12 21:31   5 2.796 16.0 GB 341.8 MB   341.8 MB  C     7:20:52    3:07:57     1 wn-sate-057
2138903.4   short bash ./run.sh       6/13 00:32  6/12 21:31   5 2.270 16.0 GB 341.8 MB   341.8 MB  C     6:52:26    3:00:55     1 wn-sate-072
2135158.0   mediu job.sh              6/13 00:10  6/12 14:34   4 3.600 4.0 GB  732.4 MB   732.4 MB  C    14:54:50    9:35:56     1 wn-sate-063
2128721.110 mediu exec_n3fit_exit0.s  6/13 00:01  6/12 12:21   4 2.770 16.0 GB   7.2 GB     7.2 GB  C    12:24:30   11:40:20     1 wn-lot-051
2139134.4   short bash ./run.sh       6/12 23:49  6/12 22:08   8 8.000 10.0 GB 317.4 MB   317.4 MB  C    13:20:51    1:40:39     1 wn-lot-037
2139134.3   short bash ./run.sh       6/12 23:49  6/12 22:08   8 7.999 10.0 GB 317.4 MB   317.4 MB  C    13:20:01    1:40:36     1 wn-lot-043
2139134.2   short bash ./run.sh       6/12 23:42  6/12 22:08   8 8.000 10.0 GB 317.4 MB   317.4 MB  C    12:22:45    1:33:26     1 wn-lot-040
2128721.202 mediu exec_n3fit_exit0.s  6/12 23:36  6/12 12:33   4 3.800 16.0 GB   7.2 GB     7.2 GB  C    11:52:27   11:03:30     1 wn-lot-004

You can see that CpusProvisioned, CpusUsage, RemoteUserCpu, and RemoteWallClockTime are not correctly related in about half the cases. I thought this was supposed to be fixed in the recent condor versions?  Weâre running 24.6.0-1.  Any ideas?

JT

$ cat anon-cpususage.cpf
SELECT NOSUMMARY
   ClusterId                      AS  JOB_ID      PRINTAS JOB_ID WIDTH -11
   JobCategory                    AS Class WIDTH 5
   join(" ",split(Cmd,"/")[size(split(Cmd,"/"))-1], Args)  AS '   CMD' WIDTH -18
   CompletionDate                 AS '  Finished ' PRINTAS DATE
   JobCurrentStartDate            AS '   Started' PRINTAS QDATE
   CpusProvisioned                AS 'CPUS'  WIDTH 3
   CpusUsage                      AS 'CPuse' WIDTH 5
   MemoryProvisioned              AS 'MEMREQ' PRINTAS READABLE_MB
   ResidentSetSize                AS '   RAM' PRINTAS READABLE_KB  WIDTH 8
   ImageSize                      AS '   MEM' PRINTAS READABLE_KB  WIDTH 10
   JobStatus                      AS "ST"        PRINTAS JOB_STATUS WIDTH 3
   interval(RemoteUserCpu)        AS  " CPU_TIME"    WIDTH 10
   interval(RemoteWallClockTime)  AS  " WALL_TIME"   WIDTH 10
   JobRunCount                    AS 'NStrt' WIDTH 5
   split(splitSlotName(LastRemoteHost)[1], ".")[0]  AS  "WorkerNode" WIDTH -18