[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Need help debugging HIBERNATE



Hi Zach, all,

On Mon, 2025-11-17 at 19:27:28 +0000, Zach McGrew wrote:
> Hey Steffen,
> 
> To answer #1, you would need D_FULLDEBUG to get the info out.

I eventually found that using the source ...
... but it only got me a bit further.
For testing, in addition to STARTD_DEBUG, I had modified the *Wait delays
to 10 minutes, and with a check cycle time of 5 minutes and the check
actually starting before the benchmark was done, I indeed got two lines
"allHibernating: slot slot1: 'NONE' (0x0)", 5 minutes apart.
Then - nothing more.

If I had knewn about debug() I might have found the culprit earlier...

> There is at least one thing that gets checked before the timer for HIBERNATE_CHECK_INTERVAL gets initialized, the results get published to "CanHibernate" in the Machine Ad. Check the results and see if you're getting that far in the process:

I saw those before, and they looked OK.
(I'm  now wondering about - what I see here too -
> WakeOnLanEnabledFlags = "Magic Packet"
> WakeOnLanSupportedFlags = "Physical Packet,UniCast Packet,MultiCast Packet,BroadCast Packet,Magic Packet"
and how I could make use of the alternatives - knowing that the "magic packet"
doesn't seem to work with my hardware, as `wakeonlan` even with `-i` doesn't
do anything to a powered-off node.)

> You're overriding the WoL check, which will print a warning in the logs when it actually comes time to hibernate, but the IsWake* will show you what Condor found for your host. We had to enable the WoL settings in the BIOS/UEFI firmware on our systems before it worked.

For one type of machines I've checked it, for another one I know there's "N/A"
instead. I didn't gather the "condor_status" info for the latter yet though.

> Having setup the hibernation and Rooster stuff here a couple of years ago, I will warn you that there are a few "gotchas" lurking about, and it doesn't work out of the box (or not quite as documented). At least one issue is in Jira [1], but it's been marked as backlog for the last couple of years.

Hm, I had found this discussed on the mailing list around late 2022/early 2023
I think, but how would I see this happening with me? (I have a valid _AD_LOG
- which might need to be compressed at some point? - but don't really know
what to look for.)

> For Systemd, you need to override the service to kill it with SIGKILL to prevent it from invalidating it's Machine Ad on shutdown/hibernation. You can create a "/etc/systemd/system/condor.service.d" directory, and drop an override.conf file that contains this:

With LINUX_HIBERNATION_METHOD removed from the config, the default seems to work
(Debian 12, HTCondor 24.6.1 ... don't hit me...) shutting down the machine for now,
but I'll keep this in mind if the Rooster part doesn't work as expected.

> # ---------- Offline Monitoring & Power Management ----------
> ABSENT_REQUIREMENTS = True
> EXPIRE_INVALIDATED_ADS = True
> COLLECTOR_PERSISTENT_AD_LOG = /var/lib/condor/spool/OfflineLog.log
> DAEMON_LIST = $(DAEMON_LIST) ROOSTER
> # ---------- Offline Monitoring & Power Management ----------
> 
> 
> With that, you need something that converts the absent ads to offline ads. I do this with a little shell script Systemd service that removes the "Absent" variable, but there's probably a better way:

Seeing that I'm getting the impression that Green Computing didn't get as much love
as one would expect - why doesn't this work out of the boox instead?

> That strips the Absent line, and re-sends the Machine Ad, so it appears in the condor_status list and Rooster can see it as offline. I launch it with a Systemd service:

Your email will become an important part of cluster documentation since I've got to
hand over responsibilities to someone younger (but also less experienced) within
almost no time ... things are already complicated enough, so why is HTCondor that
incomplete? (Or is it just us who want it do do things it wasn't meant for? OTOH,
it took how many years until NumDynamicSlots* was introduced?)

> I enable that service on the collector (which also runs Rooster for me). When the Condor team gets around to addressing [1], the service should no longer be required because the machine can go to an offline state without going to an absent state first.

I hope there's hope for this one (and related #1807), 2 1/2 years are short
compared to the whole life cycle of the project.

> I realize this email may be getting a little too long, but I hope it helps. I pieced a lot of this together between older emails to this list, trial and error, and reading the code base on GitHub. It does work pretty great though, we use Condor's power management on our cluster (~40 nodes), and our academic computer labs in Computer Science (~300 desktops).

I'll certainly reread it more than once, your effort is highly appreciated!

Thanks, Steffen