[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Hibernate and Cron interference



Dear list,

following the december discussion on hibernation and dynamic slots, I
ended up with the following configuration for hibernation:

IdleTimeout = 600
SecondsMachineIdle = 0
HibernateState = "S5"

ShouldHibernate =   (SecondsMachineIdle > $(IdleTimeout))

HIBERNATE = ifThenElse ( $(ShouldHibernate), $(HibernateState), "NONE" )
HIBERNATE_CHECK_INTERVAL = 60

use feature:StartdCronContinuous(SecondsMachineIdleUpdater,/usr/local/htcondor/update_secondsmachineidle.sh)
(the update_secondsmachineidle.sh updates SecondsMachineIdle and works well)


On the CM, There is a system cronjob that removes the absent flag for
machine that have HibernationState = "S5", to make them visible to
Rooster. I only want to handle the ones in S5 to avoid waking up
computers that have been manually turned off (eg. maintenance reason) :

This system cronjob looks like this:

hibernate_machines=$(condor_status -absent -constraint "HibernationLevel == 5" -af Name)

for machine in ${hibernate_machines} ; do
    machine_classad=$(condor_status -absent -long ${machine})
    echo "${machine_classad}" | grep -v 'Absent' | /usr/sbin/condor_advertise UPDATE_STARTD_AD - > /tmp/$$ 2>&1
done



But I have a problem, where the machines go to hibernation correctly,
but do not wake up ONLY when the Cron task is running.


Here's the description:

With the SecondsMachineIdleUpdater Cron disabled:

When a machine goes to hibernate, it correctly sends a ClassAd with
HibernationState=S5 over TCP. The Collectors gets it and stores the Ad
as absent and Offline.

The ClassAd shows:

Absent = true
CanHibernate = true
HibernationLevel = 5
HibernationState = "S5"
HibernationSupportedStates = "S3,S4,S5"
Offline = true


Then on the CM, I have this system cronjob that removes the Absent flag
for machines that have HibernationState = "S5", so Rooster can see the
machine again and wake it up.

It works very well, as long as the Crontask is not running.


But, as soon as I enable the StartdCronContinuous Cron task on the
compute node, the machine goes to hibernate, it correctly sends a
ClassAd with HibernationState=S5 over TCP, AND a ClassAd with
HibernationState = "NONE" over UDP, in the same second. I found out
about this spurious UDP ClassAd by capturing the network traffic.

The ClassAd show:

Absent = true
CanHibernate = true
HibernationLevel = 0
HibernationState = "NONE"
HibernationSupportedStates = "S3,S4,S5"


So my system cronjob fails to remove the absent flag, since
HibernationState is not set to S5.


I'm not sure what's going on here. A race condition, a bug, or just
myself not doing things correctly ?


To keep information complete, I also find out that when sniffing network
packets when the Cron is off, there is also an UDP ClassAd coming, but
just the start (incomplete transmission), so I guess Collectord just
drops it.

The ClassAd show:

Absent = true
CanHibernate = true
HibernationLevel = 0
HibernationState = "NONE"
HibernationSupportedStates = "S3,S4,S5"

Running 10.0.1 on Debian 11.


As a workaround, I can removing the HibernationState = "S5" filter, but
I believe either I did not understand how to manager Absent ClassAds, or
there something fishy out there.

I can provide with more logs/information if needed.


Thanks for your help !

-- 
Charles