Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Need help debugging HIBERNATE/ROOSTER
- Date: Tue, 18 Nov 2025 11:23:36 +0100
- From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] Need help debugging HIBERNATE/ROOSTER
Good morning again,
with timezone differences at work and only very few people into energy saving,
I "Use(d) The Source" and what's left of my intuition to get a bit further;
thanks to Zach McGrew for his musings which confirm (and extend) my findings
so far:
> To start with, I have the following in the config (I'm dropping ROOSTER
> stuff and intermediate definitions for clarity):
>
> # condor_config_val -dump -expand | grep -i Hiber
> HIBERNATE = ifThenElse(( (State == "Unclaimed") && ( ((time() - EnteredCurrentState) > (30 * 60)) || ((time() - NumDynamicSlotsTime) > (30 * 60)) ) && ((time() - DaemonStartTime) > (6 * 3600)) ), "SHUTDOWN", "NONE")
The first finding (after setting STARTD_DEBUG to D_FULLDEBUG) was that, while
the first two timing expressions kicked in at some point, the "AllHibernating"
messages didn't show up anymore.
> HIBERNATE_CHECK_INTERVAL = (5 * 60)
> HIBERNATION_OVERRIDE_WOL = True
> LINUX_HIBERNATION_METHOD = "/sys"
Since I had decided not to use (install) the "pm-suspend" package, the choice
of the LINUX_HIBERNATION_METHOD was no longer necessary, so I removed that one.
After a while of head-scratching, I also decided to remove the clause containing
"DaemonStartTime" - this was meant to extend the one-hour grace period after
system start (which the documentation claims exists, but which I wasn't able to
confirm during my tests!). This gave me (almost) instant success.
Using "LastBenchmark" instead of "DaemonStartTime" also seems to work.
> (#2 will be added soon, I'm afraid.)
So here's question #2 (to the developers):
What makes "DaemonStartTime" break the expression above, once evaluation gets
that far?
With Zach's long email in mind, I'm now wondering how to best test ROOSTER
settings and functionality, in particular with more than a suspicion that WOL
won't work out of the box (although enabled in BIOS, and reported back by
`condor_status as` "IsWakeOnLanSupported = true" - so the override wouldn't be
required for this particular class of machines I'm currently testing with).
This is question #3:
Without any job pressure, is there a means to find out what ROOSTER would do
*if* there was need for more resources?
(The default "Unhibernate" expression means that the (unavailable) machine was
involved in a successful match, correct? As such, it would be one of many of
the same type... and I'd have to limit the # of machines woken?)
Any means to run the ROOSTER_WAKEUP_CMD for a certain machine?
(Related: which UID is used to run that?)
Question #4:
With the "DaemonStartTime" debacle in mind, how would I ROOSTER_UNHIBERNATE_RANK
machines higher that have been switched off for longer (so the wear gets more
balanced between all hardware)? Would e.g. "LastHeardFrom" work? Anything better?
Question #5a:
Currently there are multiple sets of entries in the COLLECTOR_PERSISTENT_AD_LOG
for a single machine, is this normal?
Is this due to me powering up the machines manually, without ROOSTER involved?
Will the file be compactified (how and when)?
Question #5b:
Which entries should there be for a machine ClassAd in that file, set to which
values, and which ones should be absent? Zach's explanation of the extra
ClassAd submissions on a dying machine makes me rather nervous as I can't
afford a large part of the cluster being shutdown - when I'm unable to let
the ROOSTER wake it up again...
That's all for today, I'm afraid.
Thanks so far,
Steffen