[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hibernate and Cron interference



Hi Charles,

Re the below, nice sleuthing work!  Complete with packet sniffing! :)

Some thoughts:

In your pool config, did you customize the value of config knob UPDATE_COLLECTOR_WITH_TCP ?  The default value for this knob is "True", which should result in only TCP being used to send updates to the collector.

On your Execution Point, or "EP" (i.e. machines where you run the startd), what does:

    condor_config_val -v UPDATE_COLLECTOR_WITH_TCP

say? If you set the value to False, I'd suggest removing the line in your config where you set it, so that the default value of True is used.
If your UPDATE_COLLECTOR_WITH_TCP setting is True, please let us know, because in that case the below looks like a bug we will need to track down and squash.

Next, on your central manager, it is strange that you felt the need for a cronjob to remove the absent flag.  The story here is HTCondor has the notion of "Offline" ads, and also the notion of "Absent" ads.  Offline ads are for EPs that the startd explicitly shutdown via a hibernate _expression_, while Absent ads are for EPs that simply disappeared for any unexplained reason.  The condor_rooster looks for offline ads, but cannot "see" absent ads.  The (maybe faulty?) original thinking is an EP could be offline, or absent, but not both at the same time. What does condor_config-val say for ABSENT_REQUIREMENTS ?  My guess is you have it set to be "True" or something similar.  I have not confirmed this with testing, but my guess is if you instead set it to be something like  " ABSENT_REQUIREMENTS =  (HibernationLevel?:0) == 0 ", then you could eliminate the need for your script to eliminate the Absent flag for Rooster, because offline ads would no longer be tagged as also being absent.

Hope the above helps,
Todd




On 1/26/2023 9:18 AM, Charles Goyard wrote:
Dear list,

following the december discussion on hibernation and dynamic slots, I
ended up with the following configuration for hibernation:

IdleTimeout = 600
SecondsMachineIdle = 0
HibernateState = "S5"

ShouldHibernate =   (SecondsMachineIdle > $(IdleTimeout))

HIBERNATE = ifThenElse ( $(ShouldHibernate), $(HibernateState), "NONE" )
HIBERNATE_CHECK_INTERVAL = 60

use feature:StartdCronContinuous(SecondsMachineIdleUpdater,/usr/local/htcondor/update_secondsmachineidle.sh)
(the update_secondsmachineidle.sh updates SecondsMachineIdle and works well)


On the CM, There is a system cronjob that removes the absent flag for
machine that have HibernationState = "S5", to make them visible to
Rooster. I only want to handle the ones in S5 to avoid waking up
computers that have been manually turned off (eg. maintenance reason) :

This system cronjob looks like this:

hibernate_machines=$(condor_status -absent -constraint "HibernationLevel == 5" -af Name)

for machine in ${hibernate_machines} ; do
    machine_classad=$(condor_status -absent -long ${machine})
    echo "${machine_classad}" | grep -v 'Absent' | /usr/sbin/condor_advertise UPDATE_STARTD_AD - > /tmp/$$ 2>&1
done



But I have a problem, where the machines go to hibernation correctly,
but do not wake up ONLY when the Cron task is running.


Here's the description:

With the SecondsMachineIdleUpdater Cron disabled:

When a machine goes to hibernate, it correctly sends a ClassAd with
HibernationState=S5 over TCP. The Collectors gets it and stores the Ad
as absent and Offline.

The ClassAd shows:

Absent = true
CanHibernate = true
HibernationLevel = 5
HibernationState = "S5"
HibernationSupportedStates = "S3,S4,S5"
Offline = true


Then on the CM, I have this system cronjob that removes the Absent flag
for machines that have HibernationState = "S5", so Rooster can see the
machine again and wake it up.

It works very well, as long as the Crontask is not running.


But, as soon as I enable the StartdCronContinuous Cron task on the
compute node, the machine goes to hibernate, it correctly sends a
ClassAd with HibernationState=S5 over TCP, AND a ClassAd with
HibernationState = "NONE" over UDP, in the same second. I found out
about this spurious UDP ClassAd by capturing the network traffic.

The ClassAd show:

Absent = true
CanHibernate = true
HibernationLevel = 0
HibernationState = "NONE"
HibernationSupportedStates = "S3,S4,S5"


So my system cronjob fails to remove the absent flag, since
HibernationState is not set to S5.


I'm not sure what's going on here. A race condition, a bug, or just
myself not doing things correctly ?


To keep information complete, I also find out that when sniffing network
packets when the Cron is off, there is also an UDP ClassAd coming, but
just the start (incomplete transmission), so I guess Collectord just
drops it.

The ClassAd show:

Absent = true
CanHibernate = true
HibernationLevel = 0
HibernationState = "NONE"
HibernationSupportedStates = "S3,S4,S5"

Running 10.0.1 on Debian 11.


As a workaround, I can removing the HibernationState = "S5" filter, but
I believe either I did not understand how to manager Absent ClassAds, or
there something fishy out there.

I can provide with more logs/information if needed.


Thanks for your help !



-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685