[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hibernate and dynamic slots


you totally nailed it. Thank you very much, it works just like you said.

Thanks for the extra information about the -direct parameter, that totally makes sense.

Also, I always have been a bit blurry about why sometimes there was
"$(Something)" and sometimes only "Something". Everything makes sense now, and I
finally understand an old problem I had with a variable that did not refresh as expected.

Thanks again for your time and excellent guidance, this is very much appreciated.


(For reference, following is the consolidated setup from the conversation, that works with dynamic slots + hibernation.)

The hibernation setup:

TimeToWait = 3600
HibernateState = "S5"

SecondsMachineIdle = 0

ShouldHibernate =   (SecondsMachineIdle > $(TimeToWait)) \
                    && ($(WOL_SUPPORTED))

HIBERNATE = ifThenElse ( $(ShouldHibernate), $(HibernateState), "NONE" )

# Hack to detect activity from the number of active slots.
# It increments SecondsMachineIdle as long as the number of slots is 1.

use feature:StartdCronContinuous(SecondsMachineIdleUpdater,/usr/local/htcondor/update_secondsmachineidle.sh)

The dynamic partitioning setup:
use feature:PartitionableSlot

# These sets a mimimum value for the slots

MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus, {1})
MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory, {4096})
MODIFY_REQUEST_EXPR_REQUESTDISK = quantize(RequestDisk, {1024})

The update_secondsmachineidle.sh script:

# This updates the SecondsMachineIdle, which represents the time a machine has
# been seen as having only one slot. The idea is that is a machine has only one
# slot for a long time, it means it is unused and can be powered off.
# See https://www-auth.cs.wisc.edu/lists/htcondor-users/2022-December/msg00048.shtml


read -r addr<`condor_config_val startd_address_file`

while true; do
    sleep $sleeptime
    secondsidle=`condor_status -limit 1 -direct "$addr" -af "TotalSlots==1 ? $sleeptime + $secondsidle : 0"`
    echo -e "SlotID=1\nSecondsMachineIdle=${secondsidle}\n-s1\n"

Sample output of StartLog (I lowered the idle time threshold), just for fun:

Classad debug: [0.00095ms] 320 --> 320
Classad debug: [0.06819ms] SecondsMachineIdle --> 320
Classad debug: [0.11706ms] (SecondsMachineIdle > 300) && (true) --> TRUE
allHibernating: resource #1: 'S5' (0x10)
ResMgr: This machine is about to enter hibernation
In ResMgr::disableResources ()
Publishing ClassAd 'kflops' to slot1 [InSlotList matches]
Publishing ClassAd 'mips' to slot1 [InSlotList matches]
Publishing ClassAd 'SecondsMachineIdleUpdater.s1' to slot1 [SlotID matches]
All resources disabled: yes.
All resources disabled: yes.
Hibernator: Entering sleep state 'S5'.

Connection to render0415 closed by remote host.