[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Debugging HIBERNATE/ROOSTER, halfway done



Good morning,

I think I owe you (and myself - as this will be searchable in the list archives)
a summary of what I got working, and which (currently minor) details remain.

As a reminder, I was looking for a solution that gets us closer to Green Computing
with HTCondor, with hardware that refuses to power back up on WakeOnLAN packets.

Mainly following Zach McGrew's suggestions, I 

- set 
    ShouldHibernate             = (  (State == "Unclaimed") \
                                 && ( \
                                     ((time() - EnteredCurrentState) > $(HibernateWait)) \
                                  || ((time() - NumDynamicSlotsTime) > $(HibernateWait)) \
                                    ) \
                                 &&  ((time() - LastBenchmark)       > $(ServiceWait)) \
                                  )
  to follow ToddT's suggestions at the 2023 Paris workshop, and extend the grace time
  after system start to something longer (with HibernateWait and ServiceWait defined
  somewhere else)

- changed the signal to terminate the "condor" service to SIGKILL
  - of course a proper solution to issues 1806 and 1807 would be nicer (as it would
    make the next few steps obsolete)

- send a "reconfig" signal to all nodes - and waited for ServiceWait to run out

- as a result got many MachineClassAds that have "Absent=True" set in addition
  to "Offline=True", and therefore would not be considered by the Rooster

- I prepared a script that would take such ClassAds and remove the Absent attribute
  to run via cron
  - it would have been perfect if there was such a thing as COLLECTOR_CRON_* ...

So far so good. But Rooster, with
    ROOSTER_INTERVAL                = 600
    ROOSTER_MAX_UNHIBERNATE         = 20
    ROOSTER_UNHIBERNATE_RANK        = (time() - MY.LastHeardFrom)
    ROOSTER_WAKEUP_CMD              = "/usr/local/sbin/condor-mywake.sh"
    ABSENT_REQUIREMENTS             = True
    EXPIRE_INVALIDATED_ADS          = True
    COLLECTOR_PERSISTENT_AD_LOG = /var/log/condor/CollectorAdLog
    DAEMON_LIST = $(DAEMON_LIST) ROOSTER
with "condor_mywake.sh" extracting the machine and using "ipmitool" to power it on, 
wakes up nodes - but not obviously following the UNHIBERNATE_RANK?
The expression above should rank machines switched off for a longer time *higher*
than recently offlined ones, instead I saw the same machine come up multiple times
in a row, with others still off.
Also it's unclear what would happen if a particular machine selected for waking
would not come up properly - during my tests I saw the same machine addressed by
the Rooster over and over (because the script had a bug).

Since machines to be woken are selected by the Unhibernate attribute which is
derived from MachineLastMatchTime, it seems the Matchmaker that needs to be
reconfigured, likely by modifying the NEGOTIATOR_PRE_JOB_RANK?

Should I add some randomness (to the same macro, or somewhere else?) to
possibly overcome failures of matched nodes? The hardware isn't that new, and
we've seen losses before even without the extra stress by powering down/up and
the temperature changes related.

To answer some of my other questions,

> This is question #3:
> Without any job pressure, is there a means to find out what ROOSTER would do
> *if* there was need for more resources? 
> Any means to run the ROOSTER_WAKEUP_CMD for a certain machine?

A feasible (but complicated-looking, as it needs a job per machine) way would be
to add a Requirement to the job submit file that requests a certain host.
I haven't found another means to tell the Rooster to wake up *any* node unless
there are no idle ones left (matching).

> (Related: which UID is used to run that?)

"ipmitool -I lanplus" doesn't require special permissions :)

> Question #4:
> With the "DaemonStartTime" debacle in mind, how would I ROOSTER_UNHIBERNATE_RANK
> machines higher that have been switched off for longer (so the wear gets more
> balanced between all hardware)? Would e.g. "LastHeardFrom" work? Anything better?

Yes and no, see above. I'd like to have this fixed, but it's not mission-critical.

> Question #5a:
> Currently there are multiple sets of entries in the COLLECTOR_PERSISTENT_AD_LOG
> for a single machine, is this normal?
> Will the file be compactified (how and when)?

It gets rewritten, and versioned in some way (I now have .3, .4 and the "main"
one), but I'm not sure about the lifetime, I haven't set OFFLINE_EXPIRE_ADS_AFTER
so this should keep the Offline nodes forever I suppose.

Thanks so far (in particular to Zach and Christoph),
 Steffen