Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Debugging HIBERNATE/ROOSTER, halfway done
- Date: Thu, 20 Nov 2025 12:17:50 +0100
- From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
- Subject: [HTCondor-users] Debugging HIBERNATE/ROOSTER, halfway done
Good morning,
I think I owe you (and myself - as this will be searchable in the list archives)
a summary of what I got working, and which (currently minor) details remain.
As a reminder, I was looking for a solution that gets us closer to Green Computing
with HTCondor, with hardware that refuses to power back up on WakeOnLAN packets.
Mainly following Zach McGrew's suggestions, I
- set
ShouldHibernate = ( (State == "Unclaimed") \
&& ( \
((time() - EnteredCurrentState) > $(HibernateWait)) \
|| ((time() - NumDynamicSlotsTime) > $(HibernateWait)) \
) \
&& ((time() - LastBenchmark) > $(ServiceWait)) \
)
to follow ToddT's suggestions at the 2023 Paris workshop, and extend the grace time
after system start to something longer (with HibernateWait and ServiceWait defined
somewhere else)
- changed the signal to terminate the "condor" service to SIGKILL
- of course a proper solution to issues 1806 and 1807 would be nicer (as it would
make the next few steps obsolete)
- send a "reconfig" signal to all nodes - and waited for ServiceWait to run out
- as a result got many MachineClassAds that have "Absent=True" set in addition
to "Offline=True", and therefore would not be considered by the Rooster
- I prepared a script that would take such ClassAds and remove the Absent attribute
to run via cron
- it would have been perfect if there was such a thing as COLLECTOR_CRON_* ...
So far so good. But Rooster, with
ROOSTER_INTERVAL = 600
ROOSTER_MAX_UNHIBERNATE = 20
ROOSTER_UNHIBERNATE_RANK = (time() - MY.LastHeardFrom)
ROOSTER_WAKEUP_CMD = "/usr/local/sbin/condor-mywake.sh"
ABSENT_REQUIREMENTS = True
EXPIRE_INVALIDATED_ADS = True
COLLECTOR_PERSISTENT_AD_LOG = /var/log/condor/CollectorAdLog
DAEMON_LIST = $(DAEMON_LIST) ROOSTER
with "condor_mywake.sh" extracting the machine and using "ipmitool" to power it on,
wakes up nodes - but not obviously following the UNHIBERNATE_RANK?
The expression above should rank machines switched off for a longer time *higher*
than recently offlined ones, instead I saw the same machine come up multiple times
in a row, with others still off.
Also it's unclear what would happen if a particular machine selected for waking
would not come up properly - during my tests I saw the same machine addressed by
the Rooster over and over (because the script had a bug).
Since machines to be woken are selected by the Unhibernate attribute which is
derived from MachineLastMatchTime, it seems the Matchmaker that needs to be
reconfigured, likely by modifying the NEGOTIATOR_PRE_JOB_RANK?
Should I add some randomness (to the same macro, or somewhere else?) to
possibly overcome failures of matched nodes? The hardware isn't that new, and
we've seen losses before even without the extra stress by powering down/up and
the temperature changes related.
To answer some of my other questions,
> This is question #3:
> Without any job pressure, is there a means to find out what ROOSTER would do
> *if* there was need for more resources?
> Any means to run the ROOSTER_WAKEUP_CMD for a certain machine?
A feasible (but complicated-looking, as it needs a job per machine) way would be
to add a Requirement to the job submit file that requests a certain host.
I haven't found another means to tell the Rooster to wake up *any* node unless
there are no idle ones left (matching).
> (Related: which UID is used to run that?)
"ipmitool -I lanplus" doesn't require special permissions :)
> Question #4:
> With the "DaemonStartTime" debacle in mind, how would I ROOSTER_UNHIBERNATE_RANK
> machines higher that have been switched off for longer (so the wear gets more
> balanced between all hardware)? Would e.g. "LastHeardFrom" work? Anything better?
Yes and no, see above. I'd like to have this fixed, but it's not mission-critical.
> Question #5a:
> Currently there are multiple sets of entries in the COLLECTOR_PERSISTENT_AD_LOG
> for a single machine, is this normal?
> Will the file be compactified (how and when)?
It gets rewritten, and versioned in some way (I now have .3, .4 and the "main"
one), but I'm not sure about the lifetime, I haven't set OFFLINE_EXPIRE_ADS_AFTER
so this should keep the Offline nodes forever I suppose.
Thanks so far (in particular to Zach and Christoph),
Steffen