Hi Chistoph,
Thanks for the great simplification of issues/improvements and for corroborating Charles great findings. As for your question, sadly at the moment there is not a fix for not having to do SIGKILL so the host classad does not disappear at the moment. We have
opened a couple of tickets on our end to hopefully fix this issue that is #2 of your suggested improvements:
As for your suggested improvement #1 as of V10.4.0 of HTCondor using partitionable slots the partitionable slot ad (not dynamic slot ads) contains two attributes called NumDynamicSlots and NumDynamicSlotsTime. NumDynamicSlots is pretty straight forward as it
is the number of dynamic slots currently created. The NumDynamicSlotsTime is the last time that the value for NumDynamicSlots changes whether from destruction or creation of a dynamic slot. So, in theory one could use these attributes to determine if a partitionable
slot has been idle for a while rather than using the startd cron. I believe some config like below should do the trick:
ShouldHibernate = isUndefined(NumDynamicSlots) ? False : (NumDynamicSlots == 0 && (isUndefined(NumDynamicSlotsTime) ? time() - EnteredCurrentActivity : time() - NumDynamicSlotsTime) > $(TimeToWait))
These attributes come from the use of a new feature in V10.4.0 called Startd latches which are pretty cool on their own, and feel free to let me know if I should elaborate on what is happening in the configuration line.
Hope my ramblings make sense,
Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Beyer, Christoph <christoph.beyer@xxxxxxx>
Sent: Wednesday, May 24, 2023 7:04 AM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] Hibernate and Cron interference (solved) Hi,
I would like to 2nd what Charles found out and documented very precisely below :) Especially the 2nd point below is crucial to me as I run exactly into the same problem here. I need to change the KILLSIGNAL in condor.service as otherwise the host classadd disappears forever. @Condorteam: Is there a quick fix for it other than changing to SIGKILL and thus kill the condor daemons rather rude style every time ? Here the short sum up of Charles proposal" Room for improvement ==================== In my opinion, there are two improvements possible to be made : 1. Provide a easier way to detect if a machine has been sitting idle for more than some time, removing the need for the CronTask that counts slots. 2. Improve the switch to hibernation where a full, clean system shutdown is requested. That way, no need to kill -9 condor ! I think it works with the current code when you suspend instead of powerdown, or if you have different start/stop scripts. However, with the provided systemd unit, it does not work that well :-/. Best christoph -- Christoph Beyer DESY Hamburg IT-Department Notkestr. 85 Building 02b, Room 009 22607 Hamburg phone:+49-(0)40-8998-2317 mail: christoph.beyer@xxxxxxx ----- Ursprüngliche Mail ----- Von: "Charles Goyard" <cgoyard@xxxxxxx> An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx> Gesendet: Freitag, 10. Februar 2023 14:48:57 Betreff: Re: [HTCondor-users] Hibernate and Cron interference (solved) Hi all, thanks to hints and advices from this list, I was able to setup a working hibernation setup. This message is a summary of the final setup and a discussion on how things could be easier in the future ;). Note: the topic of the discussion is misleading, since the problems we experience have nothing to do with Condor Cron. The context =========== We have a VFX renderfarm, with compute-only and workstations with cycles scavenging. The changes we wanted to implement are to be able to completely power down computer (clean shutdown), and to be able to run several jobs on a single machine, to take advantage of the various IO waits caused by massive threading. On Execution Points, we have : ============================== # Do dynamic partionning MAX_SLOTS = <%= @max_slots %> # This is set from 1 to 4 depending on the CPUs. use feature:PartitionableSlot MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus, {1}) MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory, {4096}) MODIFY_REQUEST_EXPR_REQUESTDISK = quantize(RequestDisk, {1024}) START = ( TotalSlots <= $(MAX_SLOTS) + 1 ) # Wake-On-Lan and hibernation # We figure out the WOL capability from the output of ethtool TimeToWait = 3600 HibernateState = "S5" SecondsMachineIdle = 0 ShouldHibernate = ( ( SecondsMachineIdle > $(TimeToWait) ) ) HIBERNATE = ifThenElse ( $(ShouldHibernate), $(HibernateState), "NONE" ) HIBERNATE_CHECK_INTERVAL = 60 # Hack to detect activity from the number of active slots. # It increments SecondsMachineIdle as long as the number of slots is exactly 1. use feature:StartdCronContinuous(SecondsMachineIdleUpdater,/usr/local/htcondor/update_secondsmachineidle.sh) with update_secondsmachineidle.sh being : #!/bin/bash # # This updates the SecondsMachineIdle, which represents the time a machine has # been seen as having only one slot. The idea is that is a machine has only one # slot for a long time, it means it is unused and can be powered off. # # See https://www-auth.cs.wisc.edu/lists/htcondor-users/2022-December/msg00048.shtml sleeptime=20 secondsidle=0 read -r addr<`condor_config_val startd_address_file` while true; do sleep $sleeptime secondsidle=`condor_status -limit 1 -direct "$addr" -af "TotalSlots==1 ? $sleeptime + $secondsidle : 0"` echo -e "SlotID=1\nSecondsMachineIdle=${secondsidle}\n-s1\n" done Finally, we have to kill HTcondor somewhat violently to prevent a stray ClassAd that does not include the Hibernation information : The condor.service unit file : [Unit] Description=Condor Distributed High-Throughput-Computing After=network.target nslcd.service openntpd.service Wants=network.target [Service] EnvironmentFile=-/etc/default/condor ExecStart=/usr/sbin/condor_master -f Delegate=true # In the future, we will use ExecStop with a synchronous condor_off KillMode=mixed ExecReload=/bin/kill -HUP $MAINPID Restart=on-failure RestartSec=1minute WatchdogSec=20minutes TimeoutStopSec=150seconds StandardOutput=journal NotifyAccess=main KillSignal=SIGKILL ## <-- KILL instead of QUIT fixes hibernation # Matches values in Linux Kernel Tuning script LimitNOFILE=32768 TasksMax=4194303 [Install] WantedBy=multi-user.target On the Central Manager, we have =============================== # Rooster wakes nodes up DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, ROOSTER, SHARED_PORT COLLECTOR_PERSISTENT_AD_LOG = /vol/condor/offline_ads/PersistentAdLog ABSENT_REQUIREMENTS = ( (HibernationLevel?:0) == 0 ) EXPIRE_INVALIDATED_ADS = True CLASSAD_LIFETIME = 900 # 604800s is 7 days ABSENT_EXPIRE_ADS_AFTER = 604800 OFFLINE_EXPIRE_ADS_AFTER = 604800 ROOSTER_INTERVAL = 180 ROOSTER_UNHIBERNATE = ( Offline && Unhibernate ) ROOSTER_UNHIBERNATE_RANK = buf_cpuindex_avg Things seems to be working well for a few days, we were able to remove the system cron that removed the Absent flag from ClassAds. So far so good ! Side changes ============ We changed from UDP to TCP for communication between EPs and the CM. Room for improvement ==================== In my opinion, there are two improvements possible to be made : 1. Provide a easier way to detect if a machine has been sitting idle for more than some time, removing the need for the CronTask that counts slots. 2. Improve the switch to hibernation where a full, clean system shutdown is requested. That way, no need to kill -9 condor ! I think it works with the current code when you suspend instead of powerdown, or if you have different start/stop scripts. However, with the provided systemd unit, it does not work that well :-/. Thank you :) ============ Thanks to Todd, Todd and Christoph for their help. Kudos to the whole Condor team for this wonderful software ! _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ |