Hi, Iâve seen (in the past few weeks or so) an increased frequency of restarted jobs, at least so it seemed. Restarted jobs were identified by seeing a JobRunCount attribute greater than 1. Usually (up until recently) when I looked at such jobs, they had indeed been restarted for some reason in the sense of : start executing once, get evicted, start executing a second time. These new jobs OTOH did not behave so; when I looked in the job log file (the one a user can easily access, specific to that particular job) there is only one start of execution despite the JobRunCount of e.g. 3. I did some digging today, for one particular job. From the Shadow Log: 10/07/25 08:23:59 (3084882.79) (996371): Job 3084882.79 going into Hold state (code 26,0): job exceed MaxWallTime of 345600 10/07/25 08:23:59 (3084882.79) (996371): Job 3084882.79 going into Hold state (code 26,0): job exceed MaxWallTime of 345600 10/07/25 08:23:59 (3084882.79) (996371): Job 3084882.79 going into Hold state (code 26,0): job exceed MaxWallTime of 345600 10/07/25 08:23:59 (3084882.79) (996371): Job 3084882.79 is being evicted from slot1_27@xxxxxxxxxxxxxxxxxxxxx 10/07/25 08:23:59 (3084882.79) (996371): logEvictEvent with unknown reason (112), not logging. 10/07/25 08:23:59 (3084882.79) (996371): **** condor_shadow (condor_SHADOW) pid 996371 EXITING WITH STATUS 112 [ above from the previous job in that slot ] 10/07/25 08:23:59 Initializing a VANILLA shadow for job 3084884.125 10/07/25 08:23:59 (3084884.125) (4166293): LIMIT_DIRECTORY_ACCESS = <unset> 10/07/25 08:23:59 (3084884.125) (4166293): Request to run on slot1_27@xxxxxxxxxxxxxxxxxxxxx <145.107.4.148:9618?addrs=145.107.4.148-9618+[2a07-8500-120-e070--94]-9618&alias=wn-knek-005.nikhef.nl&noUDP&sock=startd_26619_ff97> was REFUSED 10/07/25 08:23:59 (3084884.125) (4166293): Job 3084884.125 is being evicted from slot1_27@xxxxxxxxxxxxxxxxxxxxx 10/07/25 08:23:59 (3084884.125) (4166293): RemoteResource::killStarter(): Could not send command to startd, will retry in 5 seconds 10/07/25 08:24:04 (3084884.125) (4166293): RemoteResource::killStarter(): Could not send command to startd, will retry in 5 seconds and from the refusing machine, StarterLog.slot1_27 10/03/25 08:23:16 (pid:3650505) Create_Process succeeded, pid=3650563 10/07/25 08:23:58 (pid:3650505) Got vacate code=1009 subcode=0 reason=Claim deactivated 10/07/25 08:23:58 (pid:3650505) ShutdownGraceful all jobs. 10/07/25 08:23:58 (pid:3650505) Process exited, pid=3650563, status=143 10/07/25 08:23:58 (pid:3650505) All jobs have exited... starter exiting 10/07/25 08:23:58 (pid:3650505) **** condor_starter (condor_STARTER) pid 3650505 EXITING WITH STATUS 0 The timing seems rather suspicious: the starter for that slot1_27 exits at 08:23:58, which is registered at the Schedd at 08:23:59 and the REFUSED message also comes at that same timestamp - maybe the starter isnât dead enough yet to accept another job? This error is relatively new; looks like it started in September, see below. The fluctuation per log file after September 1st is probably just the fluctuation in the number of jobs being submitted / run on the cluster from day to day. Before September there were only three occurrences in 20 days; for the September 1st log file, there were already 33 occurrences in three days. Any ideas? JT grep -c REFUSED ShadowLog* ShadowLog.20250812T163652:0 ShadowLog.20250819T104421:0 ShadowLog.20250825T190854:0 ShadowLog.20250827T163018:0 ShadowLog.20250828T111736:2 ShadowLog.20250829T110644:1 ShadowLog.20250830T083206:0 ShadowLog.20250831T061223:0 ShadowLog.20250901T011038:33 ShadowLog.20250904T215407:1 ShadowLog.20250906T162542:189 ShadowLog.20250908T114622:55 ShadowLog.20250909T215242:482 ShadowLog.20250911T053810:21 ShadowLog.20250915T100540:2 ShadowLog.20250916T182031:190 ShadowLog.20250917T235320:67 ShadowLog.20250918T213208:189 ShadowLog.20250919T104425:1 ShadowLog.20250919T220430:103 ShadowLog.20250920T062439:11 ShadowLog.20250922T113733:85 ShadowLog.20250923T132000:53 ShadowLog.20250926T223913:180 ShadowLog.20250929T105312:178 ShadowLog.20251002T150214:216 ShadowLog.20251005T153724:225 ShadowLog.20251008T114915:101 |