[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Unreliable job start count



Hi,

Iâve seen (in the past few weeks or so) an increased frequency of restarted jobs, at least so it seemed.  Restarted jobs were identified by seeing a JobRunCount attribute greater than 1.  Usually (up until recently) when I looked at such jobs, they had indeed been restarted for some reason in the sense of : start executing once, get evicted, start executing a second time.  These new jobs OTOH did not behave so; when I looked in the job log file (the one a user can easily access, specific to that particular job) there is only one start of execution despite the JobRunCount of e.g. 3.

I did some digging today, for one particular job.  From the Shadow Log:

10/07/25 08:23:59 (3084882.79) (996371): Job 3084882.79 going into Hold state (code 26,0): job exceed MaxWallTime of 345600
10/07/25 08:23:59 (3084882.79) (996371): Job 3084882.79 going into Hold state (code 26,0): job exceed MaxWallTime of 345600
10/07/25 08:23:59 (3084882.79) (996371): Job 3084882.79 going into Hold state (code 26,0): job exceed MaxWallTime of 345600
10/07/25 08:23:59 (3084882.79) (996371): Job 3084882.79 is being evicted from slot1_27@xxxxxxxxxxxxxxxxxxxxx
10/07/25 08:23:59 (3084882.79) (996371): logEvictEvent with unknown reason (112), not logging.
10/07/25 08:23:59 (3084882.79) (996371): **** condor_shadow (condor_SHADOW) pid 996371 EXITING WITH STATUS 112
[ above from the previous job in that slot ]
10/07/25 08:23:59 Initializing a VANILLA shadow for job 3084884.125
10/07/25 08:23:59 (3084884.125) (4166293): LIMIT_DIRECTORY_ACCESS = <unset>
10/07/25 08:23:59 (3084884.125) (4166293): Request to run on slot1_27@xxxxxxxxxxxxxxxxxxxxx <145.107.4.148:9618?addrs=145.107.4.148-9618+[2a07-8500-120-e070--94]-9618&alias=wn-knek-005.nikhef.nl&noUDP&sock=startd_26619_ff97> was REFUSED
10/07/25 08:23:59 (3084884.125) (4166293): Job 3084884.125 is being evicted from slot1_27@xxxxxxxxxxxxxxxxxxxxx
10/07/25 08:23:59 (3084884.125) (4166293): RemoteResource::killStarter(): Could not send command to startd, will retry in 5 seconds
10/07/25 08:24:04 (3084884.125) (4166293): RemoteResource::killStarter(): Could not send command to startd, will retry in 5 seconds

and from the refusing machine, StarterLog.slot1_27

10/03/25 08:23:16 (pid:3650505) Create_Process succeeded, pid=3650563
10/07/25 08:23:58 (pid:3650505) Got vacate code=1009 subcode=0 reason=Claim deactivated
10/07/25 08:23:58 (pid:3650505) ShutdownGraceful all jobs.
10/07/25 08:23:58 (pid:3650505) Process exited, pid=3650563, status=143
10/07/25 08:23:58 (pid:3650505) All jobs have exited... starter exiting
10/07/25 08:23:58 (pid:3650505) **** condor_starter (condor_STARTER) pid 3650505 EXITING WITH STATUS 0


The timing seems rather suspicious: the starter for that slot1_27 exits at 08:23:58, which is registered at the Schedd at 08:23:59 and the REFUSED message also comes at that same timestamp - maybe the starter isnât dead enough yet to accept another job?  This error is relatively new; looks like it started in September, see below.  The fluctuation per log file after September 1st is probably just the fluctuation in the number of jobs being submitted / run on the cluster from day to day.  Before September there were only three occurrences in 20 days; for the September 1st log file, there were already 33 occurrences in three days.

Any ideas?

JT


grep -c REFUSED ShadowLog*
ShadowLog.20250812T163652:0
ShadowLog.20250819T104421:0
ShadowLog.20250825T190854:0
ShadowLog.20250827T163018:0
ShadowLog.20250828T111736:2
ShadowLog.20250829T110644:1
ShadowLog.20250830T083206:0
ShadowLog.20250831T061223:0
ShadowLog.20250901T011038:33
ShadowLog.20250904T215407:1
ShadowLog.20250906T162542:189
ShadowLog.20250908T114622:55
ShadowLog.20250909T215242:482
ShadowLog.20250911T053810:21
ShadowLog.20250915T100540:2
ShadowLog.20250916T182031:190
ShadowLog.20250917T235320:67
ShadowLog.20250918T213208:189
ShadowLog.20250919T104425:1
ShadowLog.20250919T220430:103
ShadowLog.20250920T062439:11
ShadowLog.20250922T113733:85
ShadowLog.20250923T132000:53
ShadowLog.20250926T223913:180
ShadowLog.20250929T105312:178
ShadowLog.20251002T150214:216
ShadowLog.20251005T153724:225
ShadowLog.20251008T114915:101