[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hibernate and Cron interference



Hi Todd,

indeed, I have both UPDATE_COLLECTOR_WITH_TCP set to False and
ABSENT_REQUIREMENTS set to True .

So yes, I'm asking for double trouble it seems :).

I will adjust the configuration with your suggestions tomorrow morning
and see how it goes. I will be perfectly happy to drop the custom script
we use to fiddle with ClassAds on the CM.


Thanks for the insight, will update you this Wednesday !


-- 
Charles



Todd Tannenbaum wrote:
> Hi Charles,
> 
> Re the below, nice sleuthing work! Complete with packet sniffing! :)
> 
> Some thoughts:
> 
> In your pool config, did you customize the value of config knob
> UPDATE_COLLECTOR_WITH_TCP ? The default value for this knob is "True",
> which should result in only TCP being used to send updates to the collector.
> 
> On your Execution Point, or "EP" (i.e. machines where you run the startd), what does:
> 
> ÂÂÂ condor_config_val -v UPDATE_COLLECTOR_WITH_TCP
> 
> say? If you set the value to False, I'd suggest removing the line in your
> config where you set it, so that the default value of True is used.
> If your UPDATE_COLLECTOR_WITH_TCP setting is True, please let us know,
> because in that case the below looks like a bug we will need to track down
> and squash.
> 
> Next, on your central manager, it is strange that you felt the need for a
> cronjob to remove the absent flag. The story here is HTCondor has the
> notion of "Offline" ads, and also the notion of "Absent" ads. Offline ads
> are for EPs that the startd explicitly shutdown via a hibernate expression,
> while Absent ads are for EPs that simply disappeared for any unexplained
> reason. The condor_rooster looks for offline ads, but cannot "see" absent
> ads. The (maybe faulty?) original thinking is an EP could be offline, or
> absent, but not both at the same time. What does condor_config-val say for
> ABSENT_REQUIREMENTS ? My guess is you have it set to be "True" or something
> similar. I have not confirmed this with testing, but my guess is if you
> instead set it to be something like " ABSENT_REQUIREMENTS =
> (HibernationLevel?:0) == 0 ", then you could eliminate the need for your
> script to eliminate the Absent flag for Rooster, because offline ads would
> no longer be tagged as also being absent.
> 
> Hope the above helps,
> Todd
> 
> 
> 
> 
> On 1/26/2023 9:18 AM, Charles Goyard wrote:
> > Dear list,
> > 
> > following the december discussion on hibernation and dynamic slots, I
> > ended up with the following configuration for hibernation:
> > 
> > IdleTimeout = 600
> > SecondsMachineIdle = 0
> > HibernateState = "S5"
> > 
> > ShouldHibernate =   (SecondsMachineIdle > $(IdleTimeout))
> > 
> > HIBERNATE = ifThenElse ( $(ShouldHibernate), $(HibernateState), "NONE" )
> > HIBERNATE_CHECK_INTERVAL = 60
> > 
> > use feature:StartdCronContinuous(SecondsMachineIdleUpdater,/usr/local/htcondor/update_secondsmachineidle.sh)
> > (the update_secondsmachineidle.sh updates SecondsMachineIdle and works well)
> > 
> > 
> > On the CM, There is a system cronjob that removes the absent flag for
> > machine that have HibernationState = "S5", to make them visible to
> > Rooster. I only want to handle the ones in S5 to avoid waking up
> > computers that have been manually turned off (eg. maintenance reason) :
> > 
> > This system cronjob looks like this:
> > 
> > hibernate_machines=$(condor_status -absent -constraint "HibernationLevel == 5" -af Name)
> > 
> > for machine in ${hibernate_machines} ; do
> >      machine_classad=$(condor_status -absent -long ${machine})
> >      echo "${machine_classad}" | grep -v 'Absent' | /usr/sbin/condor_advertise UPDATE_STARTD_AD - > /tmp/$$ 2>&1
> > done
> > 
> > 
> > 
> > But I have a problem, where the machines go to hibernation correctly,
> > but do not wake up ONLY when the Cron task is running.
> > 
> > 
> > Here's the description:
> > 
> > With the SecondsMachineIdleUpdater Cron disabled:
> > 
> > When a machine goes to hibernate, it correctly sends a ClassAd with
> > HibernationState=S5 over TCP. The Collectors gets it and stores the Ad
> > as absent and Offline.
> > 
> > The ClassAd shows:
> > 
> > Absent = true
> > CanHibernate = true
> > HibernationLevel = 5
> > HibernationState = "S5"
> > HibernationSupportedStates = "S3,S4,S5"
> > Offline = true
> > 
> > 
> > Then on the CM, I have this system cronjob that removes the Absent flag
> > for machines that have HibernationState = "S5", so Rooster can see the
> > machine again and wake it up.
> > 
> > It works very well, as long as the Crontask is not running.
> > 
> > 
> > But, as soon as I enable the StartdCronContinuous Cron task on the
> > compute node, the machine goes to hibernate, it correctly sends a
> > ClassAd with HibernationState=S5 over TCP, AND a ClassAd with
> > HibernationState = "NONE" over UDP, in the same second. I found out
> > about this spurious UDP ClassAd by capturing the network traffic.
> > 
> > The ClassAd show:
> > 
> > Absent = true
> > CanHibernate = true
> > HibernationLevel = 0
> > HibernationState = "NONE"
> > HibernationSupportedStates = "S3,S4,S5"
> > 
> > 
> > So my system cronjob fails to remove the absent flag, since
> > HibernationState is not set to S5.
> > 
> > 
> > I'm not sure what's going on here. A race condition, a bug, or just
> > myself not doing things correctly ?
> > 
> > 
> > To keep information complete, I also find out that when sniffing network
> > packets when the Cron is off, there is also an UDP ClassAd coming, but
> > just the start (incomplete transmission), so I guess Collectord just
> > drops it.
> > 
> > The ClassAd show:
> > 
> > Absent = true
> > CanHibernate = true
> > HibernationLevel = 0
> > HibernationState = "NONE"
> > HibernationSupportedStates = "S3,S4,S5"
> > 
> > Running 10.0.1 on Debian 11.
> > 
> > 
> > As a workaround, I can removing the HibernationState = "S5" filter, but
> > I believe either I did not understand how to manager Absent ClassAds, or
> > there something fishy out there.
> > 
> > I can provide with more logs/information if needed.
> > 
> > 
> > Thanks for your help !
> > 
> 
> 
> -- 
> Todd Tannenbaum<tannenba@xxxxxxxxxxx>   University of Wisconsin-Madison
> Center for High Throughput Computing    Department of Computer Sciences
> Calendar:https://tinyurl.com/yd55mtgd   1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132                   Madison, WI 53706-1685