Re: [Condor-users] Should the schedd/startd's tolerate schedd machine reboots?

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date:	Wed, 2 Feb 2005 18:29:29 +0000
From:	Matt Hope <matthew.hope@xxxxxxxxx>
Subject:	Re: [Condor-users] Should the schedd/startd's tolerate schedd machine reboots?

On Wed, 2 Feb 2005 13:05:46 -0500, Ian Chesal <ICHESAL@xxxxxxxxxx> wrote:
> With appropriatly long ALIVE_INTERVAL (the default 300 seconds seems
> find) and MAX_CLAIM_ALIVES_MISSED (the default of 6 seems fine) I
> expected startds to tolerate a reasonably fast reboot of a schedd
> machine and continue to run jobs. I expected the startd to tolerate an
> outage of up to 30 minutes with the schedd before terminating running
> jobs. I'm not observing this behaviour though. I'm seeing startds vacate
> running jobs as soon as the schedd machine goes down. This is on WinXP
> to WinXP machines with 6.7.3. Is it perhaps due to a shutdown routine in
> the schedd? As the service is brought down does it reach out to startds
> to tell it to terminate running jobs? Can I prevent this so reboots are
> tolerated? Reboots are a necessary evil our windows development
> environment unfortunatly.

The job lease duration controls the schedd reboot survival

http://www.cs.wisc.edu/condor/manual/v6.7/2_13Special_Environment.html#sec:Job-Lease

you must 
1) make sure your execute machines will allow leasing
2) make sure your submitters include "job_lease_duration" in their
submit scripts

Are you sure both the above are happening...

(also note that if you are using the other 6.7 series functionality of
streaming output that this will prevent leasing from working)

Matt

[← Prev in Thread]	Current Thread	[Next in Thread→]
[Condor-users] Should the schedd/startd's tolerate schedd machine reboots?, Ian Chesal Re: [Condor-users] Should the schedd/startd's tolerate schedd machine reboots?, Matt Hope <=

Previous by Date:	[Condor-users] Should the schedd/startd's tolerate schedd machine reboots?, Ian Chesal
Next by Date:	RE: [Condor-users] Should the schedd/startd's tolerate schedd machinereboots?, Ian Chesal
Previous by Thread:	[Condor-users] Should the schedd/startd's tolerate schedd machine reboots?, Ian Chesal
Next by Thread:	[Condor-users] Startd segment violation, Mark Calleja
Indexes:	[Date] [Thread]

Mailing List Archives

Authenticated access

Re: [Condor-users] Should the schedd/startd's tolerate schedd machine reboots?