[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hibernate and dynamic slots



Todd Tannenbaum wrote:
> On 12/15/2022 1:04 PM, Charles Goyard wrote:
> > Hi,
> > 
> > we are a VFX shop, and we have a render farm with dedicated nodes and
> > workstations.
> > 
> > We have hibernation enabled and it works very well. The idea is to power
> > off computers after one hour of inactivity.
> > 
> > [snip]
> > 
> > But now I also enabled dynamic slots on a test setup. It was easy and
> > works well.
> > 
> > 
> > My problem is that the StateTimer of the parent slot does not get reset
> > when a slot gets created or deleted. So computers get powered off as
> > soon as all the dynamic are removed. So the ShouldHibernate expression
> > is not accurate anymore.
> > 
> > [snip]
> > 
> > What would be a nice way to detect if a machine did nothing on any slot
> > for one hour ?
> > 
> > 
> > I'm running htcondor v10.0 on Debian Linux.
> > 
> 
> Hi Charles,
> 
> Ugh, you are correct. This should be easy, but today I cannot think of a
> simple way to detect this from the partitionable/parent slot (slot1). This
> is something we will address in an upcoming release!
> 
> However, in the meantime you can get what you want today by using a "startd
> cron" daemon hook, which allows you to add custom attributes into your slot
> ads via the stdout from a script.ÂÂ Details in the Manual are here:
> https://htcondor.readthedocs.io/en/latest/admin-manual/hooks.html#daemon-classad-hooks
> 
> Below I wrote an example that you can cut and paste into your HTCondor
> configuration - i.e., just drop the below into a file in
> /etc/condor/config.d. It will configure the startd to use partitionable /
> dynamic slots, and then to also publish an attribute "SecondsMachineIdle"
> into the partitionable slot (slot1) every 20 seconds that contains an
> integer value with how many total seconds the machine has been sitting
> idle.ÂÂ You can then use SecondsMachineIdle in your HIBERNATE expression.ÂÂ
> Hope this helps.
> 
> Note the below config assumes that condor_config_val and condor_status are
> in the system path (i.e. /usr/bin), which will be the case assuming you used
> the standard system packaging.
> 
> If you are curious to decipher how the below works, note it makes use of
> Configuration Templates, which are covered in the manual here:
> https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-templates.html#available-configuration-templates
> The quoting rules of the arguments being passed to bash are the same as the
> quoting rules for passing arguments to a job in a job submit file; see the
> documentation for the submit file "arguments" command in the condor_submit
> man page at:
> https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html
> 
> Feel free to ask questions.
> 
> regards,
> Todd
> 
> #
> # Setup to use partitionable slots
> #
> use feature:PartitionableSlot
> 
> #
> # Every 20 seconds update a custom attribute SecondsMachineIdle in slot1 that contains
> # an integer value with how many total seconds the machine has been sitting idle.
> #
> use feature:StartdCronContinuous(SecondsMachineIdle,/bin/bash,"-c '\
> Â sleeptime=20; \
> Â secondsidle=0; \
> Â read -r addr<`condor_config_val startd_address_file`; \
> Â while true; \
> ÂÂÂ do sleep $sleeptime; \
> ÂÂÂ secondsidle=`condor_status -limit 1 -direct ""$addr"" -af ""TotalSlots==1 ? $sleeptime + $secondsidle : 0""`; \
> ÂÂÂ echo -e ""SlotID=1\nSecondsMachineIdle=$secondsidle\n-s1""; \
> Â done; \
> Â '")


Hi Todd,

thanks for the extensive answer. I think I'm almost there. In a
nutshell, the SecondsMachineIdle variable gets updated, but the value is
the ShouldHibernate/HIBERNATE macros does not.


With more details :


==========
What I did
==========

First, I did put the bash script in a separate file to prevent
interpolation problems, and replaced the contents of the
startd_address_file with just the hostname (bash does not like &' and <').

This gives this condor_idletime script :

#!/bin/bash

sleeptime=20
secondsidle=0
addr=`condor_config_val FULL_HOSTNAME`

while true; do
    sleep $sleeptime
    secondsidle=`condor_status -limit 1 -direct $addr -af "TotalSlots==1 ? $sleeptime + $secondsidle : 0"`
    echo -e "SlotID=1\nSecondsMachineIdle=${secondsidle}\n-s1"
done


On the condor side, I did put :

use feature:PartitionableSlot

WOL_SUPPORTED = TRUE
TimeToWait = 3600
HibernateState = "S5"

SecondsMachineIdle = 0

ShouldHibernate = (    (State == "Unclaimed") \
                    && ($(SecondsMachineIdle) > $(TimeToWait)) \
                    && ($(WOL_SUPPORTED)))

HIBERNATE = ifThenElse ( debug($(ShouldHibernate)), $(HibernateState), "NONE" )
HIBERNATE_CHECK_INTERVAL = 20

# Hack to detect activity from the number of active slots.
# It increments SecondsMachineIdle as long as the number of slots is 1.

use feature:StartdCronContinuous(SecondsMachineIdleUpdater, /usr/local/bin/condor_idletime)


==============
Where I am now
==============

So far, so good, the cron script runs and when I query the value of
SecondsMachineIdle with condor_status, I see it increasing over time.
Thanks a lot for this workaround !

# condor_status -startd -direct render0412 -af SecondsMachineIdle
1220


But it looks like the value of SecondsMachineIdle stays at 0 in ShouldHibernate.

Here is the debug ouput from HIBERNATE :
Classad debug: [0.00215ms] State --> Unclaimed
Classad debug: [0.00596ms] eval(0) --> 0
Classad debug: [0.10920ms] ((State == "Unclaimed") && (eval(0) > 3600) && (true)) --> FALSE
allHibernating: resource #1: 'NONE' (0x0)


I tried something, and it somehow works. If I do:

condor_config_val -startd -rset "SecondsMachineIdle = 200"
condor_reconfig

The value gets updated in the HIBERNATE macro :

Classad debug: [0.00215ms] State --> Unclaimed
Classad debug: [0.00596ms] eval(200) --> 200
Classad debug: [0.10920ms] ((State == "Unclaimed") && (eval(200) > 3600) && (true)) --> FALSE
allHibernating: resource #1: 'NONE' (0x0)


So maybe there just a scope/syntax problem in this part:

"SlotID=1\nSecondsMachineIdle=${secondsidle}\n-s1"

or in the ShouldHibernate macro ?


What I don't understand now, is that condor_status returns the correct
value, but the macro does not use this value.


What did I miss ? It feels like I'm really close to a working setup !


Thanks a lot,


-- 
Charles