Hi, I used configuration values such as UPDATE_INTERVAL = 15 to make sure the CM detects the node failure fast, and sure enough the node is gone from condor_status after a few seconds. But the job stays. condor_q still shows it in the running state, and -better-analyze still shows the executing node as being the failed one, even though it is already gone from condor_status. I left it running in the background, and it is only after about two hours that the job was finally restarted on another node. After a bit of research, I found this thread from 2009 describing the same behavior, with the answer saying that it would be improved soon using MAX_CLAIM_ALIVES_MISSED and ALIVE_INTERVAL: https://groups.google.com/g/condor-computing/c/Sxag4qbtfsg
I looked these macros, but it appears that it only has the schedd to send alive messages to the startd, which stops a running job if it does not receive them. But I am looking for the opposite… Then, there is also STARTD_SENDS_ALIVES, which looks to do want I want, but it is deprecated. How could I make the recovery of jobs on failed nodes faster ? Thanks, Gaëtan
Attention: |