[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hibernate and dynamic slots



Todd,

you totally nailed it. Thank you very much, it works just like you said.

Thanks for the extra information about the -direct parameter, that totally makes sense.

Also, I always have been a bit blurry about why sometimes there was
"$(Something)" and sometimes only "Something". Everything makes sense now, and I
finally understand an old problem I had with a variable that did not refresh as expected.


Thanks again for your time and excellent guidance, this is very much appreciated.


--
Charles


(For reference, following is the consolidated setup from the conversation, that works with dynamic slots + hibernation.)


The hibernation setup:
======================

WOL_SUPPORTED = TRUE
TimeToWait = 3600
HibernateState = "S5"

SecondsMachineIdle = 0

ShouldHibernate =   (SecondsMachineIdle > $(TimeToWait)) \
                    && ($(WOL_SUPPORTED))

HIBERNATE = ifThenElse ( $(ShouldHibernate), $(HibernateState), "NONE" )
HIBERNATE_CHECK_INTERVAL = 60

# Hack to detect activity from the number of active slots.
# It increments SecondsMachineIdle as long as the number of slots is 1.

use feature:StartdCronContinuous(SecondsMachineIdleUpdater,/usr/local/htcondor/update_secondsmachineidle.sh)


The dynamic partitioning setup:
===============================
use feature:PartitionableSlot

# These sets a mimimum value for the slots

MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus, {1})
MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory, {4096})
MODIFY_REQUEST_EXPR_REQUESTDISK = quantize(RequestDisk, {1024})


The update_secondsmachineidle.sh script:
========================================

#!/bin/bash
#
# This updates the SecondsMachineIdle, which represents the time a machine has
# been seen as having only one slot. The idea is that is a machine has only one
# slot for a long time, it means it is unused and can be powered off.
#
# See https://www-auth.cs.wisc.edu/lists/htcondor-users/2022-December/msg00048.shtml

sleeptime=20
secondsidle=0

read -r addr<`condor_config_val startd_address_file`

while true; do
    sleep $sleeptime
    secondsidle=`condor_status -limit 1 -direct "$addr" -af "TotalSlots==1 ? $sleeptime + $secondsidle : 0"`
    echo -e "SlotID=1\nSecondsMachineIdle=${secondsidle}\n-s1\n"
done


Sample output of StartLog (I lowered the idle time threshold), just for fun:
============================================================================

Classad debug: [0.00095ms] 320 --> 320
Classad debug: [0.06819ms] SecondsMachineIdle --> 320
Classad debug: [0.11706ms] (SecondsMachineIdle > 300) && (true) --> TRUE
allHibernating: resource #1: 'S5' (0x10)
ResMgr: This machine is about to enter hibernation
In ResMgr::disableResources ()
Publishing ClassAd 'kflops' to slot1 [InSlotList matches]
Publishing ClassAd 'mips' to slot1 [InSlotList matches]
Publishing ClassAd 'SecondsMachineIdleUpdater.s1' to slot1 [SlotID matches]
All resources disabled: yes.
All resources disabled: yes.
Hibernator: Entering sleep state 'S5'.

Connection to render0415 closed by remote host.