Hi all, From time to time, some computing nodes which execute jobs stay stuck in the Claimed state and busy activity, however the loadAv is 0.000 and the job is successfully completed.
I have noticed in the
StartLog that this line does not appear “slot1: Called deactivate_claim_forcibly()” The last line written in the log is:
07/31/18 10:55:28 slot1: Changing activity: Idle -> Busy The job finished at 11:43:37 as written in the
StarterLog.slot1: 07/31/18 10:55:30 (pid:5876) Create_Process succeeded, pid=4848 07/31/18 11:43:37 (pid:5876) Process exited, pid=4848, status=0 By comparing to other jobs (previous successful one), these lines are missing:
07/31/18 10:54:04 (pid:5308) Got SIGQUIT. Performing fast shutdown. 07/31/18 10:54:04 (pid:5308) ShutdownFast all jobs. 07/31/18 10:54:04 (pid:5308) SharedPortEndpoint: Destructor: Problem in thread shutdown notification: 0 07/31/18 10:54:04 (pid:5308) **** condor_starter (condor_STARTER) pid 5308 EXITING WITH STATUS 0 In the
MasterLog, the following error appears approximately 15 minutes after the job completion:
07/31/18 11:58:21 ERROR: Child pid 5068 appears hung! Killing it hard. 07/31/18 11:58:21 DefaultReaper unexpectedly called on pid 5068, status 0. 07/31/18 11:58:21 The SHARED_PORT (pid 5068) was killed because it was no longer responding 07/31/18 11:58:21 restarting C:\PROGRA~2\condor\bin\condor_shared_port.exe in 10 seconds 07/31/18 11:58:31 Collector port not defined, will use default: 9618 07/31/18 11:58:31 Started DaemonCore process "C:\PROGRA~2\condor\bin\condor_shared_port.exe", pid and pgroup = 5396 Does someone has an idea why those computing node stay stuck in Claimed/Busy mode? For now on, we have to restart the computing node in order to get the computing node running again… Cheers and thanks! Florian Gandor |