Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?

Date: Wed, 3 Dec 2025 17:34:29 +0100
From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
Subject: Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?

Hi Todd,

On Tue, 2025-12-02 at 14:52:04 -0600, HTCondor Users Mailinglist wrote:
> > (a) let the matchmaker forget about the assignment of the job to that
> >    machine (there must be a timeout somewhere?) and
> 
> 	I suspect there isn't a memory

No memory - would that mean that a match would be forgotten until the next
Matchmaker round?
That would be completely counterproductive in a Rooster context:
Negotiator/Matchmaker cycles, in our setup, would happen about every minute
while the rooster would run every 5 or 10 minutes, and waking machines (from
S5 state until the transition Benchmarking -> Idle) takes another 4--7 minutes
(perhaps more for the ones with more memory).
So it would make sense to keep the assignments alive for a while, although not
too long...
With a staggered power-up (to work around possible filesystem overload), and
hundreds of jobs/machines to be reactivated it may take the better part of an
hour to handle all matches, but in the meantime short-running jobs may have
already finished and others (possibly from the same job cluster) may have been
started on the machines already available.
Thus it'd make sense to review pending power-on requests - which makes me wonder
whether a ROOSTER_MAX_UNHIBERNATE setting low enough will result in the machine
(classad) list presented to the Rooster to be rewritten during a ROOSTER_INTERVAL
or whether a match list for, say, 250 jobs mapped onto 250 machines will
necessarily result in those 250 machines been activated even if the jobs have
finished by the first 100 ones (because their runtime << ROOSTER_INTERVAL)?

 -- although I could be wrong --

Hm, is the documentation in the code?

  and that the
> problem, as you suggest below, is that the unwakeable machine(s) sort to the
> same position every time, so if you have k jobs and k unwakeable machines,
> you won't ever wake a machine.

Yes, that's what I've seen for a while, until the first jobs of that cluster
had finished and their machines took over. I couldn't clearly see the effect
of this on the Matchmaker-Rooster interaction though - if there was any?

I didn't have the time yet to just submit some 500 2-minute jobs and watch
how many nodes would come up - perhaps doing that would answer my previous
question, but not explain the details.
(It's rather tricky for me to read all the logs in a "synoptic" way and
identify all the messages sent back and forth that describe a job's progress.)

> > (b) modify the NEGOTIATOR_PRE_JOB_RANK (I suppose this is the right one)
> >    to reorder Offline machines so this particular one gets ranked down
> >    /excluded in the next cycle (as long as there are other machines...)
> 
> > (Could MachineLastMatchTime be used for (b)?
> 
> 	Probably.

Likely - but it will change in parallel with LastHeardFrom, so the weighting
should be different...

> > How to balance it against LastHeardFrom which is already used to get even
> > "wear"?
> 
> 	Assuming your wear-blancing is `+(k * (time() - LastHeardFrom))`, where `k`
> is a scaling factor depending on what else is in NEGOTIATOR_PRE_JOB_RANK you
> probably want `-(l * (time() - MachineLastMatchTime))`, where `l` is a
> (positive, nonzero) constant less than `k`, so as not to overwhelm it.

Yes, that makes sense. I hope.
The proof will be in the eating, I'm afraid.
Maybe adding some hefty random() stuff will stir things, too.

> > What else comes to mind?)
> 
> 	The wake-up script could record the last (k) time(s) it tried to wake up a
> given machine and set the unwakeable machine's START expression to FALSE?

Or get out the big hammer instead and force it to be Absent, optionally
switching it off (via IPMI, of course)... let me do the first half as a
safety net (this will get the machine out of the match list sooner or later,
I hope - see above) and watch the list of Absent nodes to decide on their
power status manually (via IPMI).

BTW, what about "IsWakeAble", is that used by the Negotiator to find matching
Offline nodes, or is "Absent" just the better choice (although a bit drastic)?

Thanks so far,
 S

Follow-Ups:
- Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?
  - From: Todd L Miller

References:
- [HTCondor-users] How to handle hibernated machines failing to wakeup?
  - From: Steffen Grunewald
- Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?
  - From: Todd L Miller

Prev by Date: Re: [HTCondor-users] Consistency problems between schedd(s) view and CM?
Next by Date: Re: [HTCondor-users] Consistency problems between schedd(s) view and CM?
Previous by thread: Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?
Next by thread: Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] How to handle hibernated machines failing to wakeup?