[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hibernate and dynamic slots



On 12/15/2022 1:04 PM, Charles Goyard wrote:
Hi,

we are a VFX shop, and we have a render farm with dedicated nodes and
workstations.

We have hibernation enabled and it works very well. The idea is to power
off computers after one hour of inactivity.

[snip]

But now I also enabled dynamic slots on a test setup. It was easy and
works well.


My problem is that the StateTimer of the parent slot does not get reset
when a slot gets created or deleted. So computers get powered off as
soon as all the dynamic are removed. So the ShouldHibernate _expression_
is not accurate anymore.

[snip]

What would be a nice way to detect if a machine did nothing on any slot
for one hour ?


I'm running htcondor v10.0 on Debian Linux.


Hi Charles,

Ugh, you are correct.  This should be easy, but today I cannot think of a simple way to detect this from the partitionable/parent slot (slot1).  This is something we will address in an upcoming release! 

However, in the meantime you can get what you want today by using a "startd cron" daemon hook, which allows you to add custom attributes into your slot ads via the stdout from a script.   Details in the Manual are here:
   https://htcondor.readthedocs.io/en/latest/admin-manual/hooks.html#daemon-classad-hooks

Below I wrote an example that you can cut and paste into your HTCondor configuration - i.e., just drop the below into a file in /etc/condor/config.d.  It will configure the startd to use partitionable / dynamic slots, and then to also publish an attribute "SecondsMachineIdle" into the partitionable slot (slot1) every 20 seconds that contains an integer value with how many total seconds the machine has been sitting idle.   You can then use SecondsMachineIdle in your HIBERNATE _expression_.   Hope this helps. 

Note the below config assumes that condor_config_val and condor_status are in the system path (i.e. /usr/bin), which will be the case assuming you used the standard system packaging. 

If you are curious to decipher how the below works, note it makes use of Configuration Templates, which are covered in the manual here:
  https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-templates.html#available-configuration-templates
The quoting rules of the arguments being passed to bash are the same as the quoting rules for passing arguments to a job in a job submit file; see the documentation for the submit file "arguments" command in the condor_submit man page at:
  https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html

Feel free to ask questions.

regards,
Todd

#
# Setup to use partitionable slots
#
use feature:PartitionableSlot

#
# Every 20 seconds update  a custom attribute SecondsMachineIdle in slot1 that contains
# an integer value with how many total seconds the machine has been sitting idle.
#
use feature:StartdCronContinuous(SecondsMachineIdle,/bin/bash,"-c '\
  sleeptime=20; \
  secondsidle=0; \
  read -r addr<`condor_config_val startd_address_file`; \
  while true; \
    do sleep $sleeptime; \
    secondsidle=`condor_status -limit 1 -direct ""$addr"" -af ""TotalSlots==1 ? $sleeptime + $secondsidle : 0""`; \
    echo -e ""SlotID=1\nSecondsMachineIdle=$secondsidle\n-s1""; \
  done; \
  '")