Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Hibernate and dynamic slots
- Date: Fri, 16 Dec 2022 23:26:48 +0100
- From: Charles Goyard <cgoyard@xxxxxxx>
- Subject: Re: [HTCondor-users] Hibernate and dynamic slots
Todd,
you totally nailed it. Thank you very much, it works just like you said.
Thanks for the extra information about the -direct parameter, that totally makes sense.
Also, I always have been a bit blurry about why sometimes there was
"$(Something)" and sometimes only "Something". Everything makes sense now, and I
finally understand an old problem I had with a variable that did not refresh as expected.
Thanks again for your time and excellent guidance, this is very much appreciated.
--
Charles
(For reference, following is the consolidated setup from the conversation, that works with dynamic slots + hibernation.)
The hibernation setup:
======================
WOL_SUPPORTED = TRUE
TimeToWait = 3600
HibernateState = "S5"
SecondsMachineIdle = 0
ShouldHibernate = (SecondsMachineIdle > $(TimeToWait)) \
&& ($(WOL_SUPPORTED))
HIBERNATE = ifThenElse ( $(ShouldHibernate), $(HibernateState), "NONE" )
HIBERNATE_CHECK_INTERVAL = 60
# Hack to detect activity from the number of active slots.
# It increments SecondsMachineIdle as long as the number of slots is 1.
use feature:StartdCronContinuous(SecondsMachineIdleUpdater,/usr/local/htcondor/update_secondsmachineidle.sh)
The dynamic partitioning setup:
===============================
use feature:PartitionableSlot
# These sets a mimimum value for the slots
MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus, {1})
MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory, {4096})
MODIFY_REQUEST_EXPR_REQUESTDISK = quantize(RequestDisk, {1024})
The update_secondsmachineidle.sh script:
========================================
#!/bin/bash
#
# This updates the SecondsMachineIdle, which represents the time a machine has
# been seen as having only one slot. The idea is that is a machine has only one
# slot for a long time, it means it is unused and can be powered off.
#
# See https://www-auth.cs.wisc.edu/lists/htcondor-users/2022-December/msg00048.shtml
sleeptime=20
secondsidle=0
read -r addr<`condor_config_val startd_address_file`
while true; do
sleep $sleeptime
secondsidle=`condor_status -limit 1 -direct "$addr" -af "TotalSlots==1 ? $sleeptime + $secondsidle : 0"`
echo -e "SlotID=1\nSecondsMachineIdle=${secondsidle}\n-s1\n"
done
Sample output of StartLog (I lowered the idle time threshold), just for fun:
============================================================================
Classad debug: [0.00095ms] 320 --> 320
Classad debug: [0.06819ms] SecondsMachineIdle --> 320
Classad debug: [0.11706ms] (SecondsMachineIdle > 300) && (true) --> TRUE
allHibernating: resource #1: 'S5' (0x10)
ResMgr: This machine is about to enter hibernation
In ResMgr::disableResources ()
Publishing ClassAd 'kflops' to slot1 [InSlotList matches]
Publishing ClassAd 'mips' to slot1 [InSlotList matches]
Publishing ClassAd 'SecondsMachineIdleUpdater.s1' to slot1 [SlotID matches]
All resources disabled: yes.
All resources disabled: yes.
Hibernator: Entering sleep state 'S5'.
Connection to render0415 closed by remote host.