Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Removing a node ungracefully from the master.
- Date: Wed, 20 Feb 2013 18:46:21 -0600
- From: Nathan Panike <nwp@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Removing a node ungracefully from the master.
condor_advertise INVALIDATE_STARTD_ADS ...
The condor_advertise man page covers this pretty well.
Nathan Panike
On Wed, Feb 20, 2013 at 03:46:23PM -0800, Matt Wallington wrote:
> I am having an issue where when a node is shut down forcefully (i.e. the
> power cable is yanked from the system), The master and scheduler continues
> to think the job is running on the node until it hits a timeout (which is a
> significant amount of time). Eventually condor realizes the node is
> offline and it resubmits the job to another node. I've found if I reduce
> the timeout, then the node will timeout on jobs that run longer than the
> timeout even if the node is online and operating properly.
>
> Is there a way to forcefully remove a node from the master (a node that has
> dropped offline but Condor still thinks is running) from the command line
> on the scheduler or master? I've tried condor_off, condor_vacate,
> condor_vacate_job, etc. but none of these work because they all try to
> reach out to the node (which is now offline). Is a command to simply
> remove a node from the pool immediately and have the job start over on
> another node?
>
> Thanks,