Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Hibernate and dynamic slots
- Date: Fri, 16 Dec 2022 17:35:52 +0100
- From: Charles Goyard <cgoyard@xxxxxxx>
- Subject: Re: [HTCondor-users] Hibernate and dynamic slots
Todd Tannenbaum wrote:
> On 12/15/2022 1:04 PM, Charles Goyard wrote:
> > Hi,
> >
> > we are a VFX shop, and we have a render farm with dedicated nodes and
> > workstations.
> >
> > We have hibernation enabled and it works very well. The idea is to power
> > off computers after one hour of inactivity.
> >
> > [snip]
> >
> > But now I also enabled dynamic slots on a test setup. It was easy and
> > works well.
> >
> >
> > My problem is that the StateTimer of the parent slot does not get reset
> > when a slot gets created or deleted. So computers get powered off as
> > soon as all the dynamic are removed. So the ShouldHibernate expression
> > is not accurate anymore.
> >
> > [snip]
> >
> > What would be a nice way to detect if a machine did nothing on any slot
> > for one hour ?
> >
> >
> > I'm running htcondor v10.0 on Debian Linux.
> >
>
> Hi Charles,
>
> Ugh, you are correct. This should be easy, but today I cannot think of a
> simple way to detect this from the partitionable/parent slot (slot1). This
> is something we will address in an upcoming release!
>
> However, in the meantime you can get what you want today by using a "startd
> cron" daemon hook, which allows you to add custom attributes into your slot
> ads via the stdout from a script.ÂÂ Details in the Manual are here:
> https://htcondor.readthedocs.io/en/latest/admin-manual/hooks.html#daemon-classad-hooks
>
> Below I wrote an example that you can cut and paste into your HTCondor
> configuration - i.e., just drop the below into a file in
> /etc/condor/config.d. It will configure the startd to use partitionable /
> dynamic slots, and then to also publish an attribute "SecondsMachineIdle"
> into the partitionable slot (slot1) every 20 seconds that contains an
> integer value with how many total seconds the machine has been sitting
> idle.ÂÂ You can then use SecondsMachineIdle in your HIBERNATE expression.ÂÂ
> Hope this helps.
>
> Note the below config assumes that condor_config_val and condor_status are
> in the system path (i.e. /usr/bin), which will be the case assuming you used
> the standard system packaging.
>
> If you are curious to decipher how the below works, note it makes use of
> Configuration Templates, which are covered in the manual here:
> https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-templates.html#available-configuration-templates
> The quoting rules of the arguments being passed to bash are the same as the
> quoting rules for passing arguments to a job in a job submit file; see the
> documentation for the submit file "arguments" command in the condor_submit
> man page at:
> https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html
>
> Feel free to ask questions.
>
> regards,
> Todd
>
> #
> # Setup to use partitionable slots
> #
> use feature:PartitionableSlot
>
> #
> # Every 20 seconds update a custom attribute SecondsMachineIdle in slot1 that contains
> # an integer value with how many total seconds the machine has been sitting idle.
> #
> use feature:StartdCronContinuous(SecondsMachineIdle,/bin/bash,"-c '\
> Â sleeptime=20; \
> Â secondsidle=0; \
> Â read -r addr<`condor_config_val startd_address_file`; \
> Â while true; \
> ÂÂÂ do sleep $sleeptime; \
> ÂÂÂ secondsidle=`condor_status -limit 1 -direct ""$addr"" -af ""TotalSlots==1 ? $sleeptime + $secondsidle : 0""`; \
> ÂÂÂ echo -e ""SlotID=1\nSecondsMachineIdle=$secondsidle\n-s1""; \
> Â done; \
> Â '")
Hi Todd,
thanks for the extensive answer. I think I'm almost there. In a
nutshell, the SecondsMachineIdle variable gets updated, but the value is
the ShouldHibernate/HIBERNATE macros does not.
With more details :
==========
What I did
==========
First, I did put the bash script in a separate file to prevent
interpolation problems, and replaced the contents of the
startd_address_file with just the hostname (bash does not like &' and <').
This gives this condor_idletime script :
#!/bin/bash
sleeptime=20
secondsidle=0
addr=`condor_config_val FULL_HOSTNAME`
while true; do
sleep $sleeptime
secondsidle=`condor_status -limit 1 -direct $addr -af "TotalSlots==1 ? $sleeptime + $secondsidle : 0"`
echo -e "SlotID=1\nSecondsMachineIdle=${secondsidle}\n-s1"
done
On the condor side, I did put :
use feature:PartitionableSlot
WOL_SUPPORTED = TRUE
TimeToWait = 3600
HibernateState = "S5"
SecondsMachineIdle = 0
ShouldHibernate = ( (State == "Unclaimed") \
&& ($(SecondsMachineIdle) > $(TimeToWait)) \
&& ($(WOL_SUPPORTED)))
HIBERNATE = ifThenElse ( debug($(ShouldHibernate)), $(HibernateState), "NONE" )
HIBERNATE_CHECK_INTERVAL = 20
# Hack to detect activity from the number of active slots.
# It increments SecondsMachineIdle as long as the number of slots is 1.
use feature:StartdCronContinuous(SecondsMachineIdleUpdater, /usr/local/bin/condor_idletime)
==============
Where I am now
==============
So far, so good, the cron script runs and when I query the value of
SecondsMachineIdle with condor_status, I see it increasing over time.
Thanks a lot for this workaround !
# condor_status -startd -direct render0412 -af SecondsMachineIdle
1220
But it looks like the value of SecondsMachineIdle stays at 0 in ShouldHibernate.
Here is the debug ouput from HIBERNATE :
Classad debug: [0.00215ms] State --> Unclaimed
Classad debug: [0.00596ms] eval(0) --> 0
Classad debug: [0.10920ms] ((State == "Unclaimed") && (eval(0) > 3600) && (true)) --> FALSE
allHibernating: resource #1: 'NONE' (0x0)
I tried something, and it somehow works. If I do:
condor_config_val -startd -rset "SecondsMachineIdle = 200"
condor_reconfig
The value gets updated in the HIBERNATE macro :
Classad debug: [0.00215ms] State --> Unclaimed
Classad debug: [0.00596ms] eval(200) --> 200
Classad debug: [0.10920ms] ((State == "Unclaimed") && (eval(200) > 3600) && (true)) --> FALSE
allHibernating: resource #1: 'NONE' (0x0)
So maybe there just a scope/syntax problem in this part:
"SlotID=1\nSecondsMachineIdle=${secondsidle}\n-s1"
or in the ShouldHibernate macro ?
What I don't understand now, is that condor_status returns the correct
value, but the macro does not use this value.
What did I miss ? It feels like I'm really close to a working setup !
Thanks a lot,
--
Charles