this is interesting - I like startd crons and did not now yet that I could adress a specific slot-add using it :)
I guess, the SlotID part supposingly addresses the specific slot but what does the '-s1' part do ?
On 12/16/2022 10:35 AM, Charles Goyard
wrote:
Hi Todd,
thanks for the extensive answer.
Sure thing, your extensive information/troubleshooting makes it much
easier...
More comments inline below....
First, I did put the bash script in a separate file to prevent
interpolation problems, and replaced the contents of the
startd_address_file with just the hostname (bash does not like &' and <').
The reason I wrote the script to use the first line of the
startd_address_file is that way the condor_status -direct command
will work properly even if your central manager is down
(specifically, of the condor_collector is down). If you just pass a
hostname to condor_status, it will need to contact the
condor_collector to get the full address to contact the startd
(because it needs the port, shared_port id, maybe ccb info, etc).
One nice thing about HTCondor is the central manager can be rebooted
or updated, and everything just keeps on running.... the last thing
you want is for your config to cause all your execution points
(worker nodes) to hibernate if you reboot your central manager or if
DNS temporarily dies. I suggest you either edit your script below
to check the return code of condor_status and/or the sanity of
secondsidle (as you wrote it below, I think it will end up being a
blank [null string] ... not certain what the startd will do with
that, probably ignore that update, but you may want to test), or put
back it back to using the startd_address_file. More below...
This gives this condor_idletime script :
#!/bin/bash
sleeptime=20
secondsidle=0
addr=`condor_config_val FULL_HOSTNAME`
while true; do
sleep $sleeptime
secondsidle=`condor_status -limit 1 -direct $addr -af "TotalSlots==1 ? $sleeptime + $secondsidle : 0"`
echo -e "SlotID=1\nSecondsMachineIdle=${secondsidle}\n-s1"
done
On the condor side, I did put :
use feature:PartitionableSlot
WOL_SUPPORTED = TRUE
TimeToWait = 3600
HibernateState = "S5"
SecondsMachineIdle = 0
ShouldHibernate = ( (State == "Unclaimed") \
&& ($(SecondsMachineIdle) > $(TimeToWait)) \
&& ($(WOL_SUPPORTED)))
HIBERNATE = ifThenElse ( debug($(ShouldHibernate)), $(HibernateState), "NONE" )
HIBERNATE_CHECK_INTERVAL = 20
[snip]
So far, so good, the cron script runs and when I query the value of
SecondsMachineIdle with condor_status, I see it increasing over time.
Thanks a lot for this workaround !
Excellent!
# condor_status -startd -direct render0412 -af SecondsMachineIdle
1220
But it looks like the value of SecondsMachineIdle stays at 0 in ShouldHibernate.
Here is the debug ouput from HIBERNATE :
Classad debug: [0.00215ms] State --> Unclaimed
Classad debug: [0.00596ms] eval(0) --> 0
Classad debug: [0.10920ms] ((State == "Unclaimed") && (eval(0) > 3600) && (true)) --> FALSE
allHibernating: resource #1: 'NONE' (0x0)
I tried something, and it somehow works. If I do:
condor_config_val -startd -rset "SecondsMachineIdle = 200"
condor_reconfig
The value gets updated in the HIBERNATE macro :
Classad debug: [0.00215ms] State --> Unclaimed
Classad debug: [0.00596ms] eval(200) --> 200
Classad debug: [0.10920ms] ((State == "Unclaimed") && (eval(200) > 3600) && (true)) --> FALSE
allHibernating: resource #1: 'NONE' (0x0)
So maybe there just a scope/syntax problem in this part:
"SlotID=1\nSecondsMachineIdle=${secondsidle}\n-s1"
or in the ShouldHibernate macro ?
The problem is in your ShouldHibernate macro.
In the condor config file, the $(X) syntax will do a macro expansion
of the value of X in the config file. ClassAd references, however,
do not use/want the $() syntax. So what is happening is your
ShouldHibernate macro, which is configuring a ClassAd _expression_
that the startd will evaluate, is using the literal value of
'SecondsMachineIdle' from the config file, not from the classad! In
the config file, you set it to zero, so that is what you get. The
"condor_config_val -rset" command changes _configuration_ remotely,
not the startd classad, so that is why you see the value change
there. I guess the point is the config file is not a classad... and
quoting rules (in any context!) are always a pain...
To fix your config, in ShouldHibernate, get change
"$(SecondsMachineIdle)" to be just "SecondsMachineIdle".
Suggest you change from this above:
SecondsMachineIdle = 0
ShouldHibernate = ( (State == "Unclaimed") \
&& ($(SecondsMachineIdle) >
$(TimeToWait)) \
&& ($(WOL_SUPPORTED)))
To this instead :
ShouldHibernate = (SecondsMachineIdle > $(TimeToWait)) \
&& ($(WOL_SUPPORTED))
Hope the above makes sense. Let us know how it goes.
regards,
Todd
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/