[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dead scheduler node, how to safely revive?

Date: Tue, 14 Aug 2018 10:22:46 -0500
From: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Dead scheduler node, how to safely revive?

On 08/14/2018 05:01 AM, Steffen Grunewald wrote:

Good morning,

two weeks ago, while I was on vacation, one of our scheduler nodes died
horribly - but can probably be repaired.
I presume that all jobs that had been submitted are still known to the
schedd, and therefore would likely be restarted as soon as the machine
comes up again - but users may in the meantime have submitted identical
copies from another scheduler node, and the old copies would overwrite
their output data once they start running.
Is there a simple way to prevent this from happening?
(To learn which jobs were still in the queue would require firing up the
schedd, which would start a fresh negotiation for all of them. Catch 22?)

You could set MAX_JOBS_RUNNING = 0 on the schedd node before restarting,and the schedd will not start any jobs.Â You can then condor_q andcondor_rm them at will.

If you know you want to remove all the jobs, they are stored in thejob_queue.log.* files in the SPOOL directory, removing those files is anextreme way to removee all trace of those jobs from the schedd.


-greg

Follow-Ups:
- Re: [HTCondor-users] Dead scheduler node, how to safely revive?
  - From: Steffen Grunewald

References:
- [HTCondor-users] Dead scheduler node, how to safely revive?
  - From: Steffen Grunewald

Prev by Date: [HTCondor-users] Dead scheduler node, how to safely revive?
Next by Date: Re: [HTCondor-users] Dead scheduler node, how to safely revive?
Previous by thread: [HTCondor-users] Dead scheduler node, how to safely revive?
Next by thread: Re: [HTCondor-users] Dead scheduler node, how to safely revive?
Index(es):
- Date
- Thread