[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] power management: ROOSTER_UNHIBERNATE not working



11/15/23 09:42:12 Got 0 startd ads matching ROOSTER_UNHIBERNATE=Offline

How do I troubleshoot and fix this?
	When a machine hibernates but rooster can't find its ad in the 
collector, the usual problem is that the startd correctly sent an offline 
ad to the collector but then sent an invalidate ad to the collector before 
actually shutting down; the invalidate ad invalidates the offline ad.
	To confirm / debug this, turn up the debug level on either your 
startd or your collector; the former should log when it sends ads and the 
former when it receives them.
	Arguably, sending an invalidate ad shouldn't remove offline ads, 
but if your hibernate script allows/requires the system to shut down 
normally, that's probably the problem: the startd will invalidate its ad 
before exiting when sent a SIGTERM (as is the usual case).  This problem
has been reported to us before 
(https://opensciencegrid.atlassian.net/browse/HTCONDOR-1806), 
but we haven't been able to address it yet; my apologies.
	The work-around is to kill the startd will a SIGKILL before 
shutting HTCondor down; depending on how long shutdown takes, you may also 
need to kill the condor master to prevent it from respawning the startd.
- ToddM