Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?
- Date: Wed, 3 Dec 2025 16:49:33 -0600 (CST)
- From: Todd L Miller <tlmiller@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?
Don't forget that the negotiator matches resource requests: it
assigns resources to APs, not jobs to slots. So even if a given machine
can't be woken from sleep, there's no starvation risk. To my
understanding:
(1) The negotiator marks the offline ad as one that matched -- one that
would have been sent to an AP but wasn't, because it was offline --
with the `MachineLastMatchTime` attribute.
(2) Some amount of time later -- depending on ROOSTER_INTERVAL -- the
rooster daemon evaluates ROOSTER_UNHIBERNATE for each machine ad,
which defaults to `Offline && Unhibernate`, with the latter defaulting
to `MachineLastMatchTime =!= UNDEFINED`.
(3) If that evaluation is true, rooster attempts to wake up the machine.
So at no point is any job stuck waiting for any machine to wake up.
By default, condor_rooster wakes up every machine which had
matches every cycle. If you want to wake up machines more slowly, you can
limit that with ROOSTER_MAX_UNHIBERNATE, but then it becomes critical that
ROOSTER_INTERVAL be longer than the longest amount of time it takes to
wake up a machine (that is, from issuing the command to the new
machine ad appearing) plus the longest negotiation cycle (so that
`MatchineLastMatchTime` is updated properly).
I don't know if `MachineLastMatchTime` is cleared by the
negotiator at the beginning of every cycle or not. This seems like it
would be most useful; if it isn't, the default unhibernate expression
should probably include a recency check so that a machine isn't woken up
for a job that left the queue an hour ago.
Other than that, yes, condor_rooster assumes that there's no queue
of power-on requests: once the command is issued, it's either successful
and subsequent power-on commands are harmless, or that if a subsequent
command has an effect, it's because a previous one failed.
Or get out the big hammer instead and force it to be Absent, optionally
switching it off (via IPMI, of course)... let me do the first half as a
safety net (this will get the machine out of the match list sooner or later,
I hope - see above) and watch the list of Absent nodes to decide on their
power status manually (via IPMI).
BTW, what about "IsWakeAble", is that used by the Negotiator to find matching
Offline nodes, or is "Absent" just the better choice (although a bit drastic)?
Looking at the documentation, I'm not sure that `IsWakeEnabled`
isn't the better choice, although AFAICT neither are used by dfeault.
Semantically, "Absent" makes sense to me.
Another option, if you'd rather that rooster not keep trying to
wake up machines that won't, is to adjust ROOSTER_UNHIBERNATE to ignore
some offline ads (those marked as unwakeable). Of course, something would
have mark the ad as unwakeable.
I don't think you should need to worry about the negotiator
matching unwakeable machines -- it shouldn't have any effect on the rest
of the pool's operations -- but if you don't want it to even see them, you
can set NEGOTIATOR_SLOT_CONSTRAINT to ignore them.
-- ToddM