Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Need help debugging HIBERNATE
- Date: Mon, 17 Nov 2025 19:27:28 +0000
- From: Zach McGrew <mcgrewz@xxxxxxx>
- Subject: Re: [HTCondor-users] Need help debugging HIBERNATE
Hey Steffen,
To answer #1, you would need D_FULLDEBUG to get the info out. You could limit it to STARTD_DEBUG = D_FULLDEBUG on one machine for testing. I would suggest wrapping the expressions with the ClassAd debug() function to see what's evaluating, but again that requires D_FULLDEBUG to use.
There is at least one thing that gets checked before the timer for HIBERNATE_CHECK_INTERVAL gets initialized, the results get published to "CanHibernate" in the Machine Ad. Check the results and see if you're getting that far in the process:
condor_status -long slot1@xxxxxxxxxxxxxxxxxxxxxxxx | grep -Ei 'hiber|wake'
CanHibernate = true
HibernationLevel = 0
HibernationState = "NONE"
HibernationSupportedStates = "S3,S4,S5"
IsWakeAble = true
IsWakeOnLanEnabled = true
IsWakeOnLanSupported = true
Unhibernate = MY.MachineLastMatchTime =!= undefined
WakeOnLanEnabledFlags = "Magic Packet"
WakeOnLanSupportedFlags = "Physical Packet,UniCast Packet,MultiCast Packet,BroadCast Packet,Magic Packet"
You're overriding the WoL check, which will print a warning in the logs when it actually comes time to hibernate, but the IsWake* will show you what Condor found for your host. We had to enable the WoL settings in the BIOS/UEFI firmware on our systems before it worked.
Having setup the hibernation and Rooster stuff here a couple of years ago, I will warn you that there are a few "gotchas" lurking about, and it doesn't work out of the box (or not quite as documented). At least one issue is in Jira [1], but it's been marked as backlog for the last couple of years.
For Systemd, you need to override the service to kill it with SIGKILL to prevent it from invalidating it's Machine Ad on shutdown/hibernation. You can create a "/etc/systemd/system/condor.service.d" directory, and drop an override.conf file that contains this:
[Service]
KillSignal=SIGKILL
That will tell Systemd to SIGKILL Condor when it goes to shutdown the machine, which prevents it from invalidating it's Machine ClassAd (If the ad is invalid then Rooster can't wake it up). One of the downside to killing Condor instead of shutting it down normally is that you now need to capture absent ads (because the machine stopped phoning home to the collector and timed out) and convert them to offline ads (Rooster only tries to wakeup offline ads). First you need to keep track of Absent Ads (here's the snippet from my config):
# ---------- Offline Monitoring & Power Management ----------
ABSENT_REQUIREMENTS = True
EXPIRE_INVALIDATED_ADS = True
COLLECTOR_PERSISTENT_AD_LOG = /var/lib/condor/spool/OfflineLog.log
DAEMON_LIST = $(DAEMON_LIST) ROOSTER
# ---------- Offline Monitoring & Power Management ----------
With that, you need something that converts the absent ads to offline ads. I do this with a little shell script Systemd service that removes the "Absent" variable, but there's probably a better way:
#!/bin/sh
SLEEP_TIME=$(condor_config_val ROOSTER_INTERVAL)
if echo "${SLEEP_TIME}" | grep -q 'Not defined' ; then
SLEEP_TIME=300
fi
POOL=$(condor_config_val COLLECTOR_HOST)
if echo "${POOL}" | grep -q 'Not defined' ; then
POOL='localhost'
fi
while true
do
sleep ${SLEEP_TIME}
for h in $(condor_status -pool "${POOL}" -absent | grep slot1@ | cut -d ' ' -f 1)
do
1>&2 date -u
1>&2 echo "Host: ${h}"
if condor_status -pool "${POOL}" -absent -long "${h}" | grep -qi 'START = ' ; then
# Update ad if still valid (contains start expression)
condor_status -pool "${POOL}" -absent -long "${h}" | \
grep -v '^Absent =' | \
condor_advertise -pool "${POOL}" UPDATE_STARTD_AD_WITH_ACK -
else
1>&2 echo "Invalid classad detected!"
1>&2 condor_status -pool "${POOL}" -absent -long "${h}"
fi
done
done
That strips the Absent line, and re-sends the Machine Ad, so it appears in the condor_status list and Rooster can see it as offline. I launch it with a Systemd service:
[Unit]
Description=Convert HTCondor Absent To Offline
After=condor.service
[Service]
ExecStart=/bin/sh /etc/condor/a2o.sh
[Install]
WantedBy=multi-user.target
I enable that service on the collector (which also runs Rooster for me). When the Condor team gets around to addressing [1], the service should no longer be required because the machine can go to an offline state without going to an absent state first.
I realize this email may be getting a little too long, but I hope it helps. I pieced a lot of this together between older emails to this list, trial and error, and reading the code base on GitHub. It does work pretty great though, we use Condor's power management on our cluster (~40 nodes), and our academic computer labs in Computer Science (~300 desktops).
-Zach
Reference URLs:
1. https://opensciencegrid.atlassian.net/browse/HTCONDOR-1806
________________________________________
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
Sent: Monday, November 17, 2025 2:29 AM
To: HTCondor Users Mailinglist
Subject: [HTCondor-users] Need help debugging HIBERNATE
Good morning,
I'm in the middle of some tests to "hibernate" (which might include going
to full S5 state) and "wakeup" HTCondor EPs, and need some suggestions to
fill in what I feel to be gaps in the documentation.
To start with, I have the following in the config (I'm dropping ROOSTER
stuff and intermediate definitions for clarity):
# condor_config_val -dump -expand | grep -i Hiber
HIBERNATE = ifThenElse(( (State == "Unclaimed") && ( ((time() - EnteredCurrentState) > (30 * 60)) || ((time() - NumDynamicSlotsTime) > (30 * 60)) ) && ((time() - DaemonStartTime) > (6 * 3600)) ), "SHUTDOWN", "NONE")
HIBERNATE_CHECK_INTERVAL = (5 * 60)
HIBERNATION_OVERRIDE_WOL = True
LINUX_HIBERNATION_METHOD = "/sys"
In short, I'd like to leave a machine on and running for at least 6 hours
after it was powered up (which would set the DaemonStartTime), and also
for 30 minutes after becoming fully Unclaimed. This expression evaluates
to True for a few machines - but I cannot see anything happening in the
STARTD log (nor on the central manager).
I don't have pm-suspend installed but would like to use systemd's features,
and I'm in serious doubt whether WOL would work with my hardware.
Running
# condor_status -f "%s:" Machine -af State 'ifThenElse(((State == "Unclaimed") && (((time()-EnteredCurrentState) > (30*60)) || ((time()-NumDynamicSlotsTime) > (30*60))) && ((time()-DaemonStartTime) > (6*3600))), "SHUTDOWN", "NONE")' | dshbak -c
indeed shows some nodes with "SHUTDOWN" - they've been unused for a couple
of days now.
I'm wondering whether the HIBERNATE_CHECK is run at all, as I couldn't find
any hint in the STARTD logs (with the exception of
"HibernationManager: Hibernation is enabled" of course).
So question #1 is: What *_DEBUG setting do I need to see the check happening,
and its outcome - without getting flooded (which D_FULLDEBUG would be doing)?
And where exactly to look for what, if not in the StartdLog?
(#2 will be added soon, I'm afraid.)
Thanks,
Steffen
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/