Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Hibernate and Cron interference
- Date: Tue, 31 Jan 2023 11:42:58 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Hibernate and Cron interference
Hi Charles,
Re the below, nice sleuthing work! Complete with packet sniffing!
:)
Some thoughts:
In your pool config, did you customize the value of config knob
UPDATE_COLLECTOR_WITH_TCP ? The default value for this knob is
"True", which should result in only TCP being used to send updates
to the collector.
On your Execution Point, or "EP" (i.e. machines where you run the
startd), what does:
condor_config_val -v UPDATE_COLLECTOR_WITH_TCP
say? If you set the value to False, I'd suggest removing the line
in your config where you set it, so that the default value of True
is used.
If your UPDATE_COLLECTOR_WITH_TCP setting is True, please let us
know, because in that case the below looks like a bug we will need
to track down and squash.
Next, on your central manager, it is strange that you felt the
need for a cronjob to remove the absent flag. The story here is
HTCondor has the notion of "Offline" ads, and also the notion of
"Absent" ads. Offline ads are for EPs that the startd explicitly
shutdown via a hibernate _expression_, while Absent ads are for EPs
that simply disappeared for any unexplained reason. The
condor_rooster looks for offline ads, but cannot "see" absent
ads. The (maybe faulty?) original thinking is an EP could be
offline, or absent, but not both at the same time. What does
condor_config-val say for ABSENT_REQUIREMENTS ? My guess is you
have it set to be "True" or something similar. I have not
confirmed this with testing, but my guess is if you instead set it
to be something like " ABSENT_REQUIREMENTS =
(HibernationLevel?:0) == 0 ", then you could eliminate the need
for your script to eliminate the Absent flag for Rooster, because
offline ads would no longer be tagged as also being absent.
Hope the above helps,
Todd
On 1/26/2023 9:18 AM, Charles Goyard wrote:
Dear list,
following the december discussion on hibernation and dynamic slots, I
ended up with the following configuration for hibernation:
IdleTimeout = 600
SecondsMachineIdle = 0
HibernateState = "S5"
ShouldHibernate = (SecondsMachineIdle > $(IdleTimeout))
HIBERNATE = ifThenElse ( $(ShouldHibernate), $(HibernateState), "NONE" )
HIBERNATE_CHECK_INTERVAL = 60
use feature:StartdCronContinuous(SecondsMachineIdleUpdater,/usr/local/htcondor/update_secondsmachineidle.sh)
(the update_secondsmachineidle.sh updates SecondsMachineIdle and works well)
On the CM, There is a system cronjob that removes the absent flag for
machine that have HibernationState = "S5", to make them visible to
Rooster. I only want to handle the ones in S5 to avoid waking up
computers that have been manually turned off (eg. maintenance reason) :
This system cronjob looks like this:
hibernate_machines=$(condor_status -absent -constraint "HibernationLevel == 5" -af Name)
for machine in ${hibernate_machines} ; do
machine_classad=$(condor_status -absent -long ${machine})
echo "${machine_classad}" | grep -v 'Absent' | /usr/sbin/condor_advertise UPDATE_STARTD_AD - > /tmp/$$ 2>&1
done
But I have a problem, where the machines go to hibernation correctly,
but do not wake up ONLY when the Cron task is running.
Here's the description:
With the SecondsMachineIdleUpdater Cron disabled:
When a machine goes to hibernate, it correctly sends a ClassAd with
HibernationState=S5 over TCP. The Collectors gets it and stores the Ad
as absent and Offline.
The ClassAd shows:
Absent = true
CanHibernate = true
HibernationLevel = 5
HibernationState = "S5"
HibernationSupportedStates = "S3,S4,S5"
Offline = true
Then on the CM, I have this system cronjob that removes the Absent flag
for machines that have HibernationState = "S5", so Rooster can see the
machine again and wake it up.
It works very well, as long as the Crontask is not running.
But, as soon as I enable the StartdCronContinuous Cron task on the
compute node, the machine goes to hibernate, it correctly sends a
ClassAd with HibernationState=S5 over TCP, AND a ClassAd with
HibernationState = "NONE" over UDP, in the same second. I found out
about this spurious UDP ClassAd by capturing the network traffic.
The ClassAd show:
Absent = true
CanHibernate = true
HibernationLevel = 0
HibernationState = "NONE"
HibernationSupportedStates = "S3,S4,S5"
So my system cronjob fails to remove the absent flag, since
HibernationState is not set to S5.
I'm not sure what's going on here. A race condition, a bug, or just
myself not doing things correctly ?
To keep information complete, I also find out that when sniffing network
packets when the Cron is off, there is also an UDP ClassAd coming, but
just the start (incomplete transmission), so I guess Collectord just
drops it.
The ClassAd show:
Absent = true
CanHibernate = true
HibernationLevel = 0
HibernationState = "NONE"
HibernationSupportedStates = "S3,S4,S5"
Running 10.0.1 on Debian 11.
As a workaround, I can removing the HibernationState = "S5" filter, but
I believe either I did not understand how to manager Absent ClassAds, or
there something fishy out there.
I can provide with more logs/information if needed.
Thanks for your help !
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685