[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Reboot the submit machine and not restart jobs?



On 1/11/06, Finch, Ralph <rfinch@xxxxxxxxxxxx> wrote:
> condor -version
> $CondorVersion: 6.7.13 Nov  7 2005 $
> $CondorPlatform: INTEL-WINNT50 $

given this

> Occasionally after submitting jobs from my machine, I need to reboot my
> machine (it's windows, after all).  However, this means all my jobs must
> restart (since this is windows, it is only the vanilla universe and they
> are not checkpointed).  Since the jobs take a few hours to complete, I
> was wondering if it's possible for the jobs running in the pool on other
> machines to not restart, but simply reconnect with new condor_shadows
> when my submit machine comes back after reboot.

The answer is yes, but only if the startd's are also 6.7 series and
are configured to enable leasing.

http://www.cs.wisc.edu/condor/manual/v6.7/2_15Special_Environment.html#SECTION003154000000000000000

A proviso on this for windows is that if you shutdown normally then
the lease will not happen - it will trigger an eviction. To make it
work on shutdown you need to hard kill your condor subsystem (pskill
is going to be your friend here) as if your machine reset without
warning.!

I dislike this behaviour intensely (especially given windows likely
use of more vanilla non checkpointing jobs) - since if you have some
jobs which *can* checkpoint and others which can't then you're SOL.

When my farm goes to 6.8 I can see this being the most common "why
can't I do this?" request I get.

Matt