[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hibernate and dynamic slots



Hi,

got it - whata nice start of the year, answering my own rubbish questions :D

SlotMergeConstraint:
 Each ad can contain one of four possible attributes to control what slot ads the ad is merged into when the
condor_startd sends updates to the collector. These attributes are, in order of highest to lower priority (in other
words, if SlotMergeConstraint matches, the other attributes are not considered, and so on):
â SlotMergeConstraint _expression_: the current ad is merged into all slot ads for which this _expression_ is
true. The _expression_ is evaluated with the slot ad as the TARGET ad.
â SlotName|Name string: the current ad is merged into all slots whose Name attributes match the value of
SlotName up to the length of SlotName.
â SlotTypeId integer: the current ad is merged into all ads that have the same value for their SlotTypeId
attribute.
â SlotId integer: the current ad is merged into all ads that have the same value for their SlotId attribute.
For example, if the Startd Cron job returns:
Value=1
SlotId=1
-s1
Value=2
SlotId=2
-s2
Value=10
- update:true



--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Christoph Beyer" <christoph.beyer@xxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Montag, 2. Januar 2023 13:45:30
Betreff: Re: [HTCondor-users] Hibernate and dynamic slots

Hi Todd et al,

this is interesting - I like startd crons and did not now yet that I could adress a specific slot-add using it :)

The output of your script would be something like:

SlotID=1
SecondsMachineIdle=20
-s1


I guess, the SlotID part supposingly addresses the specific slot but what does the '-s1' part do ?

best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Todd Tannenbaum via HTCondor-users" <htcondor-users@xxxxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>, "Charles Goyard" <cgoyard@xxxxxxx>
CC: "Todd Tannenbaum" <tannenba@xxxxxxxxxxx>
Gesendet: Freitag, 16. Dezember 2022 20:04:57
Betreff: Re: [HTCondor-users] Hibernate and dynamic slots

On 12/16/2022 10:35 AM, Charles Goyard wrote:
Hi Todd,

thanks for the extensive answer. 

Sure thing, your extensive information/troubleshooting makes it much easier...

More comments inline below....

First, I did put the bash script in a separate file to prevent
interpolation problems, and replaced the contents of the
startd_address_file with just the hostname (bash does not like &' and <').

The reason I wrote the script to use the first line of the startd_address_file is that way the condor_status -direct command will work properly even if your central manager is down (specifically, of the condor_collector is down).  If you just pass a hostname to condor_status, it will need to contact the condor_collector to get the full address to contact the startd (because it needs the port, shared_port id, maybe ccb info, etc).   One nice thing about HTCondor is the central manager can be rebooted or updated, and everything just keeps on running.... the last thing you want is for your config to cause all your execution points (worker nodes) to hibernate if you reboot your central manager or if DNS temporarily dies.  I suggest you either edit your script below to check the return code of condor_status and/or the sanity of secondsidle (as you wrote it below, I think it will end up being a blank [null string] ... not certain what the startd will do with that, probably ignore that update, but you may want to test), or put back it back to using the startd_address_file.  More below...

This gives this condor_idletime script :

#!/bin/bash

sleeptime=20
secondsidle=0
addr=`condor_config_val FULL_HOSTNAME`

while true; do
    sleep $sleeptime
    secondsidle=`condor_status -limit 1 -direct $addr -af "TotalSlots==1 ? $sleeptime + $secondsidle : 0"`
    echo -e "SlotID=1\nSecondsMachineIdle=${secondsidle}\n-s1"
done


On the condor side, I did put :

use feature:PartitionableSlot

WOL_SUPPORTED = TRUE
TimeToWait = 3600
HibernateState = "S5"

SecondsMachineIdle = 0

ShouldHibernate = (    (State == "Unclaimed") \
                    && ($(SecondsMachineIdle) > $(TimeToWait)) \
                    && ($(WOL_SUPPORTED)))

HIBERNATE = ifThenElse ( debug($(ShouldHibernate)), $(HibernateState), "NONE" )
HIBERNATE_CHECK_INTERVAL = 20
[snip]
So far, so good, the cron script runs and when I query the value of
SecondsMachineIdle with condor_status, I see it increasing over time.
Thanks a lot for this workaround !

Excellent!
# condor_status -startd -direct render0412 -af SecondsMachineIdle
1220


But it looks like the value of SecondsMachineIdle stays at 0 in ShouldHibernate.

Here is the debug ouput from HIBERNATE :
Classad debug: [0.00215ms] State --> Unclaimed
Classad debug: [0.00596ms] eval(0) --> 0
Classad debug: [0.10920ms] ((State == "Unclaimed") && (eval(0) > 3600) && (true)) --> FALSE
allHibernating: resource #1: 'NONE' (0x0)


I tried something, and it somehow works. If I do:

condor_config_val -startd -rset "SecondsMachineIdle = 200"
condor_reconfig

The value gets updated in the HIBERNATE macro :
Classad debug: [0.00215ms] State --> Unclaimed
Classad debug: [0.00596ms] eval(200) --> 200
Classad debug: [0.10920ms] ((State == "Unclaimed") && (eval(200) > 3600) && (true)) --> FALSE
allHibernating: resource #1: 'NONE' (0x0)


So maybe there just a scope/syntax problem in this part:

"SlotID=1\nSecondsMachineIdle=${secondsidle}\n-s1"

or in the ShouldHibernate macro ?

The problem is in your ShouldHibernate macro.

In the condor config file, the $(X) syntax will do a macro expansion of the value of X in the config file.  ClassAd references, however, do not use/want the $() syntax.  So what is happening is your ShouldHibernate macro, which is configuring a ClassAd _expression_ that the startd will evaluate, is using the literal value of 'SecondsMachineIdle' from the config file, not from the classad!  In the config file, you set it to zero, so that is what you get.  The "condor_config_val -rset" command changes _configuration_  remotely, not the startd classad, so that is why you see the value change there.  I guess the point is the config file is not a classad... and quoting rules (in any context!) are always a pain...

To fix your config, in ShouldHibernate, get change "$(SecondsMachineIdle)" to be just "SecondsMachineIdle". 

Suggest you change from this above:

  SecondsMachineIdle = 0

  ShouldHibernate = (    (State == "Unclaimed") \
                    && ($(SecondsMachineIdle) > $(TimeToWait)) \
                    && ($(WOL_SUPPORTED)))

To this instead :


  ShouldHibernate = (SecondsMachineIdle > $(TimeToWait)) \
                    && ($(WOL_SUPPORTED))

Hope the above makes sense.  Let us know how it goes.

regards,
Todd


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/