Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?

Date: Wed, 3 Dec 2025 16:49:33 -0600 (CST)
From: Todd L Miller <tlmiller@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?

Don't forget that the negotiator matches resource requests: itassigns resources to APs, not jobs to slots. So even if a given machinecan't be woken from sleep, there's no starvation risk. To myunderstanding:


(1) The negotiator marks the offline ad as one that matched -- one that
    would have been sent to an AP but wasn't, because it was offline --
    with the `MachineLastMatchTime` attribute.
(2) Some amount of time later -- depending on ROOSTER_INTERVAL -- the
    rooster daemon evaluates ROOSTER_UNHIBERNATE for each machine ad,
    which defaults to `Offline && Unhibernate`, with the latter defaulting
    to `MachineLastMatchTime =!= UNDEFINED`.
(3) If that evaluation is true, rooster attempts to wake up the machine.

So at no point is any job stuck waiting for any machine to wake up.

By default, condor_rooster wakes up every machine which hadmatches every cycle. If you want to wake up machines more slowly, you canlimit that with ROOSTER_MAX_UNHIBERNATE, but then it becomes critical thatROOSTER_INTERVAL be longer than the longest amount of time it takes towake up a machine (that is, from issuing the command to the newmachine ad appearing) plus the longest negotiation cycle (so that`MatchineLastMatchTime` is updated properly).

I don't know if `MachineLastMatchTime` is cleared by thenegotiator at the beginning of every cycle or not. This seems like itwould be most useful; if it isn't, the default unhibernate expressionshould probably include a recency check so that a machine isn't woken upfor a job that left the queue an hour ago.

Other than that, yes, condor_rooster assumes that there's no queueof power-on requests: once the command is issued, it's either successfuland subsequent power-on commands are harmless, or that if a subsequentcommand has an effect, it's because a previous one failed.

Or get out the big hammer instead and force it to be Absent, optionally
switching it off (via IPMI, of course)... let me do the first half as a
safety net (this will get the machine out of the match list sooner or later,
I hope - see above) and watch the list of Absent nodes to decide on their
power status manually (via IPMI).

BTW, what about "IsWakeAble", is that used by the Negotiator to find matching
Offline nodes, or is "Absent" just the better choice (although a bit drastic)?

Looking at the documentation, I'm not sure that `IsWakeEnabled`isn't the better choice, although AFAICT neither are used by dfeault.


	Semantically, "Absent" makes sense to me.

Another option, if you'd rather that rooster not keep trying towake up machines that won't, is to adjust ROOSTER_UNHIBERNATE to ignoresome offline ads (those marked as unwakeable). Of course, something wouldhave mark the ad as unwakeable.

I don't think you should need to worry about the negotiatormatching unwakeable machines -- it shouldn't have any effect on the restof the pool's operations -- but if you don't want it to even see them, youcan set NEGOTIATOR_SLOT_CONSTRAINT to ignore them.


-- ToddM

Follow-Ups:
- Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?
  - From: Steffen Grunewald

References:
- [HTCondor-users] How to handle hibernated machines failing to wakeup?
  - From: Steffen Grunewald
- Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?
  - From: Todd L Miller
- Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?
  - From: Steffen Grunewald

Prev by Date: Re: [HTCondor-users] Question About Pelican and Apptainer Versions with HTCondor 24.0.14
Next by Date: Re: [HTCondor-users] CondorCE: jobs taking different routes albeit the same route condition are fulfilled
Previous by thread: Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?
Next by thread: Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?