[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hibernate and Cron interference (solved)



Hi all,

thanks to hints and advices from this list, I was able to setup a
working hibernation setup.

This message is a summary of the final setup and a discussion on how
things could be easier in the future ;).

Note: the topic of the discussion is misleading, since the problems we
experience have nothing to do with Condor Cron.


The context
===========

We have a VFX renderfarm, with compute-only and workstations with cycles
scavenging.

The changes we wanted to implement are to be able to completely power
down computer (clean shutdown), and to be able to run several jobs on a
single machine, to take advantage of the various IO waits caused by
massive threading.


On Execution Points, we have :
==============================

# Do dynamic partionning

MAX_SLOTS = <%= @max_slots %> # This is set from 1 to 4 depending on the CPUs.

use feature:PartitionableSlot

MODIFY_REQUEST_EXPR_REQUESTCPUS   = quantize(RequestCpus, {1})
MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory, {4096})
MODIFY_REQUEST_EXPR_REQUESTDISK   = quantize(RequestDisk, {1024})

START = ( TotalSlots <= $(MAX_SLOTS) + 1 )


# Wake-On-Lan and hibernation
# We figure out the WOL capability from the output of ethtool

TimeToWait = 3600
HibernateState = "S5"

SecondsMachineIdle = 0

ShouldHibernate = ( ( SecondsMachineIdle > $(TimeToWait) ) )

HIBERNATE = ifThenElse ( $(ShouldHibernate), $(HibernateState), "NONE" )
HIBERNATE_CHECK_INTERVAL = 60

# Hack to detect activity from the number of active slots.
# It increments SecondsMachineIdle as long as the number of slots is exactly 1.

use feature:StartdCronContinuous(SecondsMachineIdleUpdater,/usr/local/htcondor/update_secondsmachineidle.sh)

with update_secondsmachineidle.sh being :

#!/bin/bash
#
# This updates the SecondsMachineIdle, which represents the time a machine has
# been seen as having only one slot. The idea is that is a machine has only one
# slot for a long time, it means it is unused and can be powered off.
#
# See https://www-auth.cs.wisc.edu/lists/htcondor-users/2022-December/msg00048.shtml

sleeptime=20
secondsidle=0

read -r addr<`condor_config_val startd_address_file`

while true; do
    sleep $sleeptime
    secondsidle=`condor_status -limit 1 -direct "$addr" -af "TotalSlots==1 ? $sleeptime + $secondsidle : 0"`
    echo -e "SlotID=1\nSecondsMachineIdle=${secondsidle}\n-s1\n"
done


Finally, we have to kill HTcondor somewhat violently to prevent a stray
ClassAd that does not include the Hibernation information :

The condor.service unit file :

[Unit]
Description=Condor Distributed High-Throughput-Computing
After=network.target nslcd.service openntpd.service
Wants=network.target

[Service]
EnvironmentFile=-/etc/default/condor
ExecStart=/usr/sbin/condor_master -f
Delegate=true
# In the future, we will use ExecStop with a synchronous condor_off
KillMode=mixed
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=1minute
WatchdogSec=20minutes
TimeoutStopSec=150seconds
StandardOutput=journal
NotifyAccess=main
KillSignal=SIGKILL  ## <-- KILL instead of QUIT fixes hibernation
# Matches values in Linux Kernel Tuning script
LimitNOFILE=32768
TasksMax=4194303

[Install]
WantedBy=multi-user.target


On the Central Manager, we have
===============================

# Rooster wakes nodes up
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, ROOSTER, SHARED_PORT

COLLECTOR_PERSISTENT_AD_LOG = /vol/condor/offline_ads/PersistentAdLog

ABSENT_REQUIREMENTS = ( (HibernationLevel?:0) == 0 )
EXPIRE_INVALIDATED_ADS = True
CLASSAD_LIFETIME = 900
# 604800s is 7 days
ABSENT_EXPIRE_ADS_AFTER = 604800
OFFLINE_EXPIRE_ADS_AFTER = 604800

ROOSTER_INTERVAL = 180
ROOSTER_UNHIBERNATE = ( Offline && Unhibernate )
ROOSTER_UNHIBERNATE_RANK = buf_cpuindex_avg



Things seems to be working well for a few days, we were able to remove
the system cron that removed the Absent flag from ClassAds. So far so
good !


Side changes
============

We changed from UDP to TCP for communication between EPs and the CM.


Room for improvement
====================

In my opinion, there are two improvements possible to be made :

1. Provide a easier way to detect if a machine has been sitting idle for
more than some time, removing the need for the CronTask that counts
slots.

2. Improve the switch to hibernation where a full, clean system shutdown
is requested. That way, no need to kill -9 condor ! I think it works
with the current code when you suspend instead of powerdown, or if you
have different start/stop scripts. However, with the provided systemd
unit, it does not work that well :-/.


Thank you :)
============

Thanks to Todd, Todd and Christoph for their help. Kudos to the whole
Condor team for this wonderful software !