Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Dead scheduler node, how to safely revive?
- Date: Tue, 14 Aug 2018 12:01:39 +0200
- From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
- Subject: [HTCondor-users] Dead scheduler node, how to safely revive?
Good morning,
two weeks ago, while I was on vacation, one of our scheduler nodes died
horribly - but can probably be repaired.
I presume that all jobs that had been submitted are still known to the
schedd, and therefore would likely be restarted as soon as the machine
comes up again - but users may in the meantime have submitted identical
copies from another scheduler node, and the old copies would overwrite
their output data once they start running.
Is there a simple way to prevent this from happening?
(To learn which jobs were still in the queue would require firing up the
schedd, which would start a fresh negotiation for all of them. Catch 22?)
Any suggestion is welcome.
Thanks,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~