[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] more information on incorrect RemoteUserCpu and -currentrun



Hi Jeff,

In the eyes of Condor, the current run time of a job is how long that it has been in the running state for.Which the job enters once resources have been claimed and a shadow starts. So, current time is more specifically of how long the shadow has been running as opposed to "slot connect duration". This why you saw a run time similar to when in the job log a reconnect occurred since the log message before hand stated the shadow died.

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Jeff Templon <templon@xxxxxxxxx>
Sent: Thursday, December 15, 2022 8:25 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] more information on incorrect RemoteUserCpu and -currentrun
 

And even more information : I noticed that the dates below are from 2021 - there was an admin “oops” with a date command, and this is likely responsible for all the slots having disconnected and reconnected. OTOH, it still stands, that the run time is shown is no longer a run time:

└> grep 49718.318 ferm.49718.log
000 (49718.318.000) 2022-11-24 11:10:44 Job submitted from host: <145.107.7.239:9618?addrs=145.107.7.239-9618+[2a07-8500-120-e070--3ef]-9618&alias=visar.nikhef.nl&noUDP&sock=schedd_1153_70c9>
001 (49718.318.000) 2022-11-24 11:11:33 Job executing on host: <145.107.5.45:9618?addrs=145.107.5.45-9618+[2a07-8500-120-e070--52d]-9618&alias=wn-lot-045.nikhef.nl&noUDP&sock=startd_58987_c10f>
006 (49718.318.000) 2022-11-24 11:11:42 Image size of job updated: 976780
[ … ]
006 (49718.318.000) 2022-12-04 09:56:08 Image size of job updated: 976780
006 (49718.318.000) 2022-12-10 22:02:15 Image size of job updated: 976780
022 (49718.318.000) 2021-10-02 12:58:00 Job disconnected, attempting to reconnect
023 (49718.318.000) 2021-10-02 12:58:01 Job reconnected to slot1_26@xxxxxxxxxxxxxxxxxxxx
022 (49718.318.000) 2022-12-14 14:38:05 Job disconnected, attempting to reconnect
023 (49718.318.000) 2022-12-14 14:38:05 Job reconnected to slot1_26@xxxxxxxxxxxxxxxxxxxx
┌[kiwish-4.2]-(gofact_extendrange_ganymede/log)-[git:master*]-
└> condor_q -allusers -nobatch -currentrun -pr $HOME/an1.cpf 49718.318


-- Schedd: visar.nikhef.nl : <145.107.7.239:9618?... @ 12/15/22 15:21:35
JOB_ID    Username CMD                       CPUS MEMREQ   ST    RUN_TIME    WorkerNode
49718.318 templon  ferm.condor 368           1    128.0 MB R      1+00:43:30 wn-lot-045

Definition of RUN_TIME: RemoteUserCpu AS " RUN_TIME" PRINTAS CPU_TIME

JT