[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Dead scheduler node, how to safely revive?

Date: Tue, 14 Aug 2018 12:01:39 +0200
From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
Subject: [HTCondor-users] Dead scheduler node, how to safely revive?

Good morning,

two weeks ago, while I was on vacation, one of our scheduler nodes died
horribly - but can probably be repaired.
I presume that all jobs that had been submitted are still known to the
schedd, and therefore would likely be restarted as soon as the machine
comes up again - but users may in the meantime have submitted identical
copies from another scheduler node, and the old copies would overwrite
their output data once they start running.
Is there a simple way to prevent this from happening?
(To learn which jobs were still in the queue would require firing up the
schedd, which would start a fresh negotiation for all of them. Catch 22?)

Any suggestion is welcome.

Thanks,
 Steffen

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

Follow-Ups:
- Re: [HTCondor-users] Dead scheduler node, how to safely revive?
  - From: Greg Thain

Prev by Date: Re: [HTCondor-users] "Job has not yet been considered by the matchmaker' When trying to submit to two machines.
Next by Date: Re: [HTCondor-users] Dead scheduler node, how to safely revive?
Previous by thread: Re: [HTCondor-users] Python: how to query for userprios?
Next by thread: Re: [HTCondor-users] Dead scheduler node, how to safely revive?
Index(es):
- Date
- Thread