We're in the process of refreshing our Condor provision and have a cluster with 7.4.3 Linux central controller and 7.4.2 Windows 7 worker nodes. We're also doing work on power management of the clusters.
We've got all the stuff in place (OFFLINE_LOG is set and is being written to; OFFLINE_EXPIRE_ADS_AFTER isn't set so should default to MAXINT) so nodes are getting woken up to service jobs.
Problem we're seeing is that in some cases the OS is hibernating even if the node is claimed. Condor isn't told/doesn't notice (I'm not sure which) this happening so the job sits "running" for two hours before giving up.
At the moment hibernation of machines is controlled outside Condor. They can be woken up either by Condor or outside Condor (for automatic nightly updates).
Two problems I see ...
1. Our hibernation process doesn't notice if a slot is claimed/busy but hasn't got round to starting the user processes yet. What's the neatest way to check this? Check the output of "condor_status" on the local machine?
2. Assuming that some of the time the external process is going to hibernate a machine (or even restart the OS) even if it's running jobs how should we handle this? My reading of the code (ResMgr.cpp) is that we only close down the slots and vacate jobs if Condor is deciding to hibernate the node rather than something outside Condor doing it. Should we be doing a condor_off as part of the [external] hibernate/restart process?
Paul