[HTCondor-users] Unreliable job start count

Iâve seen (in the past few weeks or so) an increased frequency of restarted jobs, at least so it seemed. Restarted jobs were identified by seeing a JobRunCount attribute greater than 1. Usually (up until recently) when I looked at such jobs, they had indeed been restarted for some reason in the sense of : start executing once, get evicted, start executing a second time. These new jobs OTOH did not behave so; when I looked in the job log file (the one a user can easily access, specific to that particular job) there is only one start of execution despite the JobRunCount of e.g. 3.

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

10/07/25 08:23:59 (3084882.79) (996371): Job 3084882.79 going into Hold state (code 26,0): job exceed MaxWallTime of 345600

10/07/25 08:23:59 (3084882.79) (996371): Job 3084882.79 is being evicted from slot1_27@xxxxxxxxxxxxxxxxxxxxx

10/07/25 08:23:59 (3084882.79) (996371): logEvictEvent with unknown reason (112), not logging.

10/07/25 08:23:59 (3084882.79) (996371): **** condor_shadow (condor_SHADOW) pid 996371 EXITING WITH STATUS 112

[ above from the previous job in that slot ]

10/07/25 08:23:59 Initializing a VANILLA shadow for job 3084884.125

10/07/25 08:23:59 (3084884.125) (4166293): LIMIT_DIRECTORY_ACCESS = <unset>

10/07/25 08:23:59 (3084884.125) (4166293): Request to run on slot1_27@xxxxxxxxxxxxxxxxxxxxx <145.107.4.148:9618?addrs=145.107.4.148-9618+[2a07-8500-120-e070--94]-9618&alias=wn-knek-005.nikhef.nl&noUDP&sock=startd_26619_ff97> was REFUSED

10/07/25 08:23:59 (3084884.125) (4166293): Job 3084884.125 is being evicted from slot1_27@xxxxxxxxxxxxxxxxxxxxx

10/07/25 08:23:59 (3084884.125) (4166293): RemoteResource::killStarter(): Could not send command to startd, will retry in 5 seconds

10/07/25 08:24:04 (3084884.125) (4166293): RemoteResource::killStarter(): Could not send command to startd, will retry in 5 seconds

10/03/25 08:23:16 (pid:3650505) Create_Process succeeded, pid=3650563

10/07/25 08:23:58 (pid:3650505) Got vacate code=1009 subcode=0 reason=Claim deactivated

10/07/25 08:23:58 (pid:3650505) ShutdownGraceful all jobs.

10/07/25 08:23:58 (pid:3650505) Process exited, pid=3650563, status=143

10/07/25 08:23:58 (pid:3650505) All jobs have exited... starter exiting

10/07/25 08:23:58 (pid:3650505) **** condor_starter (condor_STARTER) pid 3650505 EXITING WITH STATUS 0

The timing seems rather suspicious: the starter for that slot1_27 exits at 08:23:58, which is registered at the Schedd at 08:23:59 and the REFUSED message also comes at that same timestamp - maybe the starter isnât dead enough yet to accept another job? This error is relatively new; looks like it started in September, see below. The fluctuation per log file after September 1st is probably just the fluctuation in the number of jobs being submitted / run on the cluster from day to day. Before September there were only three occurrences in 20 days; for the September 1st log file, there were already 33 occurrences in three days.

grep -c REFUSED ShadowLog*

ShadowLog.20250812T163652:0

ShadowLog.20250819T104421:0

ShadowLog.20250825T190854:0

ShadowLog.20250827T163018:0

ShadowLog.20250828T111736:2

ShadowLog.20250829T110644:1

ShadowLog.20250830T083206:0

ShadowLog.20250831T061223:0

ShadowLog.20250901T011038:33

ShadowLog.20250904T215407:1

ShadowLog.20250906T162542:189

ShadowLog.20250908T114622:55

ShadowLog.20250909T215242:482

ShadowLog.20250911T053810:21

ShadowLog.20250915T100540:2

ShadowLog.20250916T182031:190

ShadowLog.20250917T235320:67

ShadowLog.20250918T213208:189

ShadowLog.20250919T104425:1

ShadowLog.20250919T220430:103

ShadowLog.20250920T062439:11

ShadowLog.20250922T113733:85

ShadowLog.20250923T132000:53

ShadowLog.20250926T223913:180

ShadowLog.20250929T105312:178

ShadowLog.20251002T150214:216

ShadowLog.20251005T153724:225

ShadowLog.20251008T114915:101

Mailing List Archives

Authenticated access

[HTCondor-users] Unreliable job start count