I use job_lease_duration = 30 in my job submit file to get shorter delay for job reschedule when compute nodes failed. But this will cause long time job killed. I am using condor 7.8.5(x64) and CentOS 6.3. Below are log files:
In StartLog(Only show the content about slot3):
02/24/14 10:29:46 slot3: State change: claiming protocol successful
02/24/14 10:29:46 slot3: Changing state: Unclaimed -> Claimed
02/24/14 10:29:46 slot3: Got activate_claim request from shadow (10.1.1.1)
02/24/14 10:29:46 slot3: Error evaluating machine rank _expression_: None
02/24/14 10:29:46 slot3: Setting RANK to 0.0
02/24/14 10:29:46 slot3: Remote job ID is 496.0
02/24/14 10:29:46 slot3: Got universe "VANILLA" (5) from request classad
02/24/14 10:29:46 slot3: State change: claim-activation protocol successful
02/24/14 10:29:46 slot3: Changing activity: Idle -> Busy
02/24/14 10:44:51 slot3: State change: claim no longer recognized by the schedd - removing claim
02/24/14 10:44:51 slot3: Changing state and activity: Claimed/Busy -> Preempting/Killing
02/24/14 10:45:21 slot3: starter (pid 28021) is not responding to the request to hardkill its job. The startd will now directly hard kill the starter and all its decendents.
02/24/14 10:45:21 Starter pid 28021 died on signal 9 (signal 9 (Killed))
02/24/14 10:45:21 slot3: State change: starter exited
02/24/14 10:45:21 slot3: State change: No preempting claim, returning to owner
02/24/14 10:45:21 slot3: Changing state and activity: Preempting/Killing -> Owner/Idle
02/24/14 10:45:21 slot3: State change: IS_OWNER is false
02/24/14 10:45:21 slot3: Changing state: Owner -> Unclaimed
In StarterLog.Slot3
02/24/14 10:29:47 Job 496.0 set to execute immediately
02/24/14 10:29:47 Starting a VANILLA universe job with ID: 496.0
......
02/24/14 10:29:47 About to exec /usr/bin/csfexec
02/24/14 10:29:47 Running job as user root
02/24/14 10:29:47 Create_Process succeeded, pid=28029
02/24/14 10:44:51 Got SIGQUIT. Performing fast shutdown.
02/24/14 10:44:51 ShutdownFast all jobs.
02/24/14 10:44:52 Process exited, pid=28029, signal=9
02/24/14 10:44:52 condor_write(): Socket closed when trying to write 381 bytes to <
10.1.1.1:42920>, fd is 9
02/24/14 10:44:52 Buf::write(): condor_write() failed
02/24/14 10:44:52 condor_write(): Socket closed when trying to write 91 bytes to <
10.1.1.1:42920>, fd is 9
02/24/14 10:44:52 Buf::write(): condor_write() failed
02/24/14 10:44:52 Failed to send job exit status to shadow
02/24/14 10:44:52 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
02/24/14 10:44:52 Returning from CStarter::JobReaper()
I think the reason is in StartLog: 'State change: claim no longer recognized by the schedd - removing claim'. If I remove the job_lease_duration = 30 setting in job submit file, job will not be killed.
Why this setting cause this? How can I avoid long time jobs killed?
Thanks in advance!