[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?



Don't forget that the negotiator matches resource requests: it assigns resources to APs, not jobs to slots. So even if a given machine can't be woken from sleep, there's no starvation risk. To my understanding:

(1) The negotiator marks the offline ad as one that matched -- one that
    would have been sent to an AP but wasn't, because it was offline --
    with the `MachineLastMatchTime` attribute.
(2) Some amount of time later -- depending on ROOSTER_INTERVAL -- the
    rooster daemon evaluates ROOSTER_UNHIBERNATE for each machine ad,
    which defaults to `Offline && Unhibernate`, with the latter defaulting
    to `MachineLastMatchTime =!= UNDEFINED`.
(3) If that evaluation is true, rooster attempts to wake up the machine.

So at no point is any job stuck waiting for any machine to wake up.

By default, condor_rooster wakes up every machine which had matches every cycle. If you want to wake up machines more slowly, you can limit that with ROOSTER_MAX_UNHIBERNATE, but then it becomes critical that ROOSTER_INTERVAL be longer than the longest amount of time it takes to wake up a machine (that is, from issuing the command to the new machine ad appearing) plus the longest negotiation cycle (so that `MatchineLastMatchTime` is updated properly).

I don't know if `MachineLastMatchTime` is cleared by the negotiator at the beginning of every cycle or not. This seems like it would be most useful; if it isn't, the default unhibernate expression should probably include a recency check so that a machine isn't woken up for a job that left the queue an hour ago.

Other than that, yes, condor_rooster assumes that there's no queue of power-on requests: once the command is issued, it's either successful and subsequent power-on commands are harmless, or that if a subsequent command has an effect, it's because a previous one failed.

Or get out the big hammer instead and force it to be Absent, optionally
switching it off (via IPMI, of course)... let me do the first half as a
safety net (this will get the machine out of the match list sooner or later,
I hope - see above) and watch the list of Absent nodes to decide on their
power status manually (via IPMI).

BTW, what about "IsWakeAble", is that used by the Negotiator to find matching
Offline nodes, or is "Absent" just the better choice (although a bit drastic)?

Looking at the documentation, I'm not sure that `IsWakeEnabled` isn't the better choice, although AFAICT neither are used by dfeault.

	Semantically, "Absent" makes sense to me.

Another option, if you'd rather that rooster not keep trying to wake up machines that won't, is to adjust ROOSTER_UNHIBERNATE to ignore some offline ads (those marked as unwakeable). Of course, something would have mark the ad as unwakeable.

I don't think you should need to worry about the negotiator matching unwakeable machines -- it shouldn't have any effect on the rest of the pool's operations -- but if you don't want it to even see them, you can set NEGOTIATOR_SLOT_CONSTRAINT to ignore them.

-- ToddM