[HTCondor-users] Restart a job after its node failed

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hi,

I was testing how Condor handles worker node failures, by starting jobs on a node and then shutting it down.

I used configuration values such as UPDATE_INTERVAL = 15 to make sure the CM detects the node failure fast, and sure enough the node is gone from condor_status after a few seconds.

But the job stays. condor_q still shows it in the running state, and -better-analyze still shows the executing node as being the failed one, even though it is already gone from condor_status.

I left it running in the background, and it is only after about two hours that the job was finally restarted on another node.

After a bit of research, I found this thread from 2009 describing the same behavior, with the answer saying that it would be improved soon using MAX_CLAIM_ALIVES_MISSED and ALIVE_INTERVAL:

https://groups.google.com/g/condor-computing/c/Sxag4qbtfsg

I looked these macros, but it appears that it only has the schedd to send alive messages to the startd, which stops a running job if it does not receive them. But I am looking for the opposite…

Then, there is also STARTD_SENDS_ALIVES, which looks to do want I want, but it is deprecated.

How could I make the recovery of jobs on failed nodes faster ?

Thanks,

Gaëtan

Gaetan Geffroy
Junior Software Engineer, Space

Terma GmbH
Europaarkaden II, Bratustraße 7, 64293 Darmstadt, Germany
T +49 6151 86005 43 (direct) • T +49 6151 86005-0
Terma GmbH - Sitz Darmstadt • Handelsregister Nr.: HRB 7411, Darmstadt
Geschäftsführer: Poul Vigh / Steen Vejby Sørensen
www.terma.com • Linkedin • Twitter • Instagram • Youtube

Attention:
This e-mail (and attachment(s), if any) - intended for the addressee(s) only - may contain confidential, copyright, or legally privileged information or material, and no one else is authorized to read, print, store, copy, forward, or otherwise use or disclose any part of its contents or attachment(s) in any form. If you have received this e-mail in error, please notify me by telephone or return e-mail, and delete this e-mail and attachment(s). Thank you.

Mailing List Archives

Authenticated access

[HTCondor-users] Restart a job after its node failed