Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] offline compute nodes and Rooster
- Date: Mon, 18 Oct 2010 10:49:56 +0100
- From: Paul Haldane <paul.haldane@xxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] offline compute nodes and Rooster
> From: Paul Haldane
> Sent: 17 October 2010 16:49
>
> > From: Paul Haldane
> > Sent: 16 October 2010 14:24
> >
> > > > > 3. Offline slots _should_ (I think they should, but would like
> > > > > confirmation) continue to appear in the output of condor_status (using
> > > > > -constraint Offline to just see offline slots). In our environment
> > > > > they only appear for 10/20 minutes after powering off. This isn't what
> > > > > I expect because OFFLINE_EXPIRE_ADS_AFTER defaults to maxint.
> > > >
> > > > Yes, the offline ads should remain visible in condor_status. They
> > > > should not expire in 30 minutes if you are using the default
> > > > OFFLINE_EXPIRE_ADS_AFTER.
> >
> > I've just been able to grab (using condor_status -l
> > yard10.campus.ncl.ac.uk) the ADS for a machine that's unpingable (so it
> > is hibernating) but still visible in condor_status output.
> >
> > I won't include all 109 lines of output here (unless that would be
> > useful - full version is at
> > http://www.staff.ncl.ac.uk/paul.haldane/yard10.txt). All looks
> > plausible to me apart from
> >
> > Offline = ((CurrentTime - EnteredCurrentState) >= 60 &&
> > MachineLastMatchTime =?= UNDEFINED && State =?= "Unclaimed")
> >
> > Is that correct or should it just be a simple Boolean value?
> >
> > I know why it's showing that value ("Offline = $(ShouldHibernate)" in
> > the config file on the compute nodes) but perfectly willing to believe
> > that it's rubbish.
>
> I've made progress on a couple of fronts.
>
> 1. Realised that we'd changed ROOSTER_UNHIBERNATE to a daft setting.
>
> We had
>
> ROOSTER_UNHIBERNATE = Unhibernate && Offline =?= False
>
> ... which I don't think would ever match. Changing it to the default value of
>
> ROOSTER_UNHIBERNATE = Unhibernate && Offline == True
>
> ... worked better but because I don't think we're setting Unhibernate properly
> yet I've currently got
>
> ROOSTER_UNHIBERNATE = Offline == True
May as well point out myself that that's a really dumb idea. Leads to Rooster waking up any offline machines even when they're not needed to service jobs.
Paul