Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Debugging HIBERNATE/ROOSTER, halfway done
- Date: Fri, 21 Nov 2025 10:33:21 +0100
- From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] Debugging HIBERNATE/ROOSTER, halfway done
Good morning Zach (and everyone else still listening :)),
On Thu, 2025-11-20 at 21:11:54 +0000, Zach McGrew wrote:
> > Also it's unclear what would happen if a particular machine selected for waking
> would not come up properly - during my tests I saw the same machine addressed by
> the Rooster over and over (because the script had a bug).
>
> I've had this happen with machines where the jobs get stuck scheduling to the same offline machine instead of trying another host.
While this happened, I found that the Matchmaker had selected a single host only,
following its NEGOTIATOR_PRE_JOB_RANK rule. I could only see that single machine
have MachineLastMatchTime set to something not undefined, but since the rule
was fixed I couldn't find out whether the Matchmaker would take another round,
and when.
This is where I probably should add a small random() part to the RANK expression,
to be able to move to another machine... and find out whether Matchmaker matches
actually expire and are re-done (maybe using a machine-name requirement matching
a few nodes).
> Rooster attempts to wake the machine, it fails for any number of reasons, and the loop repeats of matching to the offline machine.
That seems to be due to the Rooster not doing any matching itself (this is why
ROOSTER_UNHIBERNATE_RANK didn't help me in the first place), fully relying on
what the Negotiator assigned.
> I remember seeing an "Avoiding Blackholes" page on the old Condor wiki somewhere,
That one still exists? I remember ignoring more and more of the suggestions just
because several macros had been obsoleted over the years...
> I usually just condor_hold && condor_release the job which is enough to reschedule it somewhere else.
For the admin's viewpoint that's undesirable - I wouldn't want to have to find out
which job that was (although there's "condor_q -analyze -reverse").
But it becomes clearer and clearer that switching off nodes, to save electrical
energy, is only the smaller part of the task - while the remaining stuff may cost
a lot of unmetered human energy ;)
> That shouldn't be too bad to test. Something like this that reads from a list of hosts to attempt to wakeup:
As almost always, reality was way quicker than my own attempts to test. Multiple
nodes have been reactivated, and while the selection criteria still need some
adjustment (back to the JOB_RANK from before...), it "just works" (somehow, at
least). Right now I see no unused nodes, and there are no idle jobs, so all is
fine.
Let's see what will break over the next days. (And what can be fixed, and how.)
> queue host from hosts.txt
(I admit I didn't know that.)
> Submitting that would schedule one job per host specified, and if offline should trigger Rooster to attempt to wake it up.
Yes, my idea had been to do that, but one of my users was faster.
> > It gets rewritten, and versioned in some way (I now have .3, .4 and the "main"
> one), but I'm not sure about the lifetime, I haven't set OFFLINE_EXPIRE_ADS_AFTER
> so this should keep the Offline nodes forever I suppose.
>
> I'm hoping to upgrade my hardware before the next ~4085 years (INT_MAX seconds), and this won't be a problem for losing offline nodes. =)
At least there's still a place in configs that won't get hit by Y2K36 :)
> And jumping back to your previous questions, "but not obviously following the UNHIBERNATE_RANK?" and "likely by modifying the NEGOTIATOR_PRE_JOB_RANK?":
>
> Those would be solving two different things. The NEGOTIATOR_PRE_JOB_RANK would help determine which machines match (example: do you prefer to schedule to a particular type of system, prefer to schedule to a system that's online or off, etc.).
... and therefore has to consider online nodes (like before) and offline nodes (which
should be matched in an order that determines their wakeup order...
> I have mine set to fill nodes depth first, keeping others idle or off when possible.
IIRC this was a requirement for the Rooster to match machine requirements best, and
strongly suggested by the docs. I had it set all the time because I'd like to keep
machines as little fragmented as possible even without excessive defrag usage.
The ROOSTER_UNHIBERNATE_RANK is used once a list of offline hosts have been matched
Yes, that's what I found the hard way...
> and need to be powered on, and determines what order to wake them up in (example: prefer to turn on a GPU node before the CPU node because maybe it takes longer to boot?)
TBH I'm still wondering what that would be good for - as the job-machine match seems to
be unchanged (for some time at least, details to be found out).
> Everything in the list will eventually be woken, but this lets you adjust what order to do it in.
One point to consider might be the extra power consumption that happens right after
power-on, when all fans spin up to their maximum rpm (during BMC init there's no
temperature-driven fan control) and the power consumption of multiple nodes combined
might trigger a circuit breaker? (Don't ask me how to map such "topology stuff" to
a RANK rule...)
> Good luck with your green computing! I got to see some of the building power usage graphs before and after we set this up in our computer labs.
Someone here came up with a very rough estimate of the order of 10--15%.
The money saved won't buy a human helper to manually power cycle failing machines.
How many nodes will be lost due to extra thermal stress (due to changing states not
only between "hot" = fully loaded and "cool" = idle, but to "cold" = off) and related
connector failures needs to be found out.
We'll be doing this with old, out-of-service units first, before possibly proceeding
to new machines. For the latter, we also need to find out whether this operation mode
would be still covered by warranty - as the BMC will record every single boot-up...
> It was a very noticeable drop off for power usage, and we still get to run research jobs when they land in the queue. Win-win situation.
Now if I could do the same with the (variable) section of the setup that runs Slurm...
(Idea: Start slurmd's via HTCondor jobs, resume those nodes from "down" state, let job
run, kill HTCondor jobs. This somehow mimics the old Parallel Universe openmpiscript.)
Thanks,
Steffen