Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Debugging HIBERNATE/ROOSTER, halfway done

Date: Thu, 20 Nov 2025 21:11:54 +0000
From: Zach McGrew <mcgrewz@xxxxxxx>
Subject: Re: [HTCondor-users] Debugging HIBERNATE/ROOSTER, halfway done

Hey Steffen,

> Also it's unclear what would happen if a particular machine selected for waking
would not come up properly - during my tests I saw the same machine addressed by
the Rooster over and over (because the script had a bug).

I've had this happen with machines where the jobs get stuck scheduling to the same offline machine instead of trying another host. Rooster attempts to wake the machine, it fails for any number of reasons, and the loop repeats of matching to the offline machine. I remember seeing an "Avoiding Blackholes" page on the old Condor wiki somewhere, but I haven't tried it see if it would work to solve this issue or not. I seem to recall it being something done with the submission itself, but that could still be done server side through job transforms. It's on my to-do list but doesn't happen very often from my experience; I usually just condor_hold && condor_release the job which is enough to reschedule it somewhere else.

> A feasible (but complicated-looking, as it needs a job per machine) way would be
to add a Requirement to the job submit file that requests a certain host.
I haven't found another means to tell the Rooster to wake up *any* node unless
there are no idle ones left (matching).

That shouldn't be too bad to test. Something like this that reads from a list of hosts to attempt to wakeup:

# Very important task here
executable = /usr/bin/sleep
arguments = 60
request_cpus = 1
request_memory = 1GB
Requirements = (Machine == "$(host)")
queue host from hosts.txt


And then hosts.txt would just be one host per line:
cf405-07.cs.wwu.edu
cf165-07.cs.wwu.edu
kb311-07.cs.wwu.edu

Submitting that would schedule one job per host specified, and if offline should trigger Rooster to attempt to wake it up.

> It gets rewritten, and versioned in some way (I now have .3, .4 and the "main"
one), but I'm not sure about the lifetime, I haven't set OFFLINE_EXPIRE_ADS_AFTER
so this should keep the Offline nodes forever I suppose.

I'm hoping to upgrade my hardware before the next ~4085 years (INT_MAX seconds), and this won't be a problem for losing offline nodes. =)


And jumping back to your previous questions, "but not obviously following the UNHIBERNATE_RANK?" and "likely by modifying the NEGOTIATOR_PRE_JOB_RANK?":

Those would be solving two different things. The NEGOTIATOR_PRE_JOB_RANK would help determine which machines match (example: do you prefer to schedule to a particular type of system, prefer to schedule to a system that's online or off, etc.). I have mine set to fill nodes depth first, keeping others idle or off when possible. The ROOSTER_UNHIBERNATE_RANK is used once a list of offline hosts have been matched and need to be powered on, and determines what order to wake them up in (example: prefer to turn on a GPU node before the CPU node because maybe it takes longer to boot?) Everything in the list will eventually be woken, but this lets you adjust what order to do it in.

Good luck with your green computing! I got to see some of the building power usage graphs before and after we set this up in our computer labs. It was a very noticeable drop off for power usage, and we still get to run research jobs when they land in the queue. Win-win situation.

-Zach

________________________________________
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
Sent: Thursday, November 20, 2025 3:17 AM
To: HTCondor-Users Mail List
Subject: [HTCondor-users] Debugging HIBERNATE/ROOSTER, halfway done

Good morning,

I think I owe you (and myself - as this will be searchable in the list archives)
a summary of what I got working, and which (currently minor) details remain.

As a reminder, I was looking for a solution that gets us closer to Green Computing
with HTCondor, with hardware that refuses to power back up on WakeOnLAN packets.

Mainly following Zach McGrew's suggestions, I

- set
    ShouldHibernate             = (  (State == "Unclaimed") \
                                 && ( \
                                     ((time() - EnteredCurrentState) > $(HibernateWait)) \
                                  || ((time() - NumDynamicSlotsTime) > $(HibernateWait)) \
                                    ) \
                                 &&  ((time() - LastBenchmark)       > $(ServiceWait)) \
                                  )
  to follow ToddT's suggestions at the 2023 Paris workshop, and extend the grace time
  after system start to something longer (with HibernateWait and ServiceWait defined
  somewhere else)

- changed the signal to terminate the "condor" service to SIGKILL
  - of course a proper solution to issues 1806 and 1807 would be nicer (as it would
    make the next few steps obsolete)

- send a "reconfig" signal to all nodes - and waited for ServiceWait to run out

- as a result got many MachineClassAds that have "Absent=True" set in addition
  to "Offline=True", and therefore would not be considered by the Rooster

- I prepared a script that would take such ClassAds and remove the Absent attribute
  to run via cron
  - it would have been perfect if there was such a thing as COLLECTOR_CRON_* ...

So far so good. But Rooster, with
    ROOSTER_INTERVAL                = 600
    ROOSTER_MAX_UNHIBERNATE         = 20
    ROOSTER_UNHIBERNATE_RANK        = (time() - MY.LastHeardFrom)
    ROOSTER_WAKEUP_CMD              = "/usr/local/sbin/condor-mywake.sh"
    ABSENT_REQUIREMENTS             = True
    EXPIRE_INVALIDATED_ADS          = True
    COLLECTOR_PERSISTENT_AD_LOG = /var/log/condor/CollectorAdLog
    DAEMON_LIST = $(DAEMON_LIST) ROOSTER
with "condor_mywake.sh" extracting the machine and using "ipmitool" to power it on,
wakes up nodes - but not obviously following the UNHIBERNATE_RANK?
The expression above should rank machines switched off for a longer time *higher*
than recently offlined ones, instead I saw the same machine come up multiple times
in a row, with others still off.
Also it's unclear what would happen if a particular machine selected for waking
would not come up properly - during my tests I saw the same machine addressed by
the Rooster over and over (because the script had a bug).

Since machines to be woken are selected by the Unhibernate attribute which is
derived from MachineLastMatchTime, it seems the Matchmaker that needs to be
reconfigured, likely by modifying the NEGOTIATOR_PRE_JOB_RANK?

Should I add some randomness (to the same macro, or somewhere else?) to
possibly overcome failures of matched nodes? The hardware isn't that new, and
we've seen losses before even without the extra stress by powering down/up and
the temperature changes related.

To answer some of my other questions,

> This is question #3:
> Without any job pressure, is there a means to find out what ROOSTER would do
> *if* there was need for more resources?
> Any means to run the ROOSTER_WAKEUP_CMD for a certain machine?

A feasible (but complicated-looking, as it needs a job per machine) way would be
to add a Requirement to the job submit file that requests a certain host.
I haven't found another means to tell the Rooster to wake up *any* node unless
there are no idle ones left (matching).

> (Related: which UID is used to run that?)

"ipmitool -I lanplus" doesn't require special permissions :)

> Question #4:
> With the "DaemonStartTime" debacle in mind, how would I ROOSTER_UNHIBERNATE_RANK
> machines higher that have been switched off for longer (so the wear gets more
> balanced between all hardware)? Would e.g. "LastHeardFrom" work? Anything better?

Yes and no, see above. I'd like to have this fixed, but it's not mission-critical.

> Question #5a:
> Currently there are multiple sets of entries in the COLLECTOR_PERSISTENT_AD_LOG
> for a single machine, is this normal?
> Will the file be compactified (how and when)?

It gets rewritten, and versioned in some way (I now have .3, .4 and the "main"
one), but I'm not sure about the lifetime, I haven't set OFFLINE_EXPIRE_ADS_AFTER
so this should keep the Offline nodes forever I suppose.

Thanks so far (in particular to Zach and Christoph),
 Steffen
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

Follow-Ups:
- Re: [HTCondor-users] Debugging HIBERNATE/ROOSTER, halfway done
  - From: Steffen Grunewald

References:
- [HTCondor-users] Need help debugging HIBERNATE
  - From: Steffen Grunewald
- Re: [HTCondor-users] Need help debugging HIBERNATE/ROOSTER
  - From: Steffen Grunewald
- [HTCondor-users] Debugging HIBERNATE/ROOSTER, halfway done
  - From: Steffen Grunewald

Prev by Date: [HTCondor-users] Debugging HIBERNATE/ROOSTER, halfway done
Next by Date: Re: [HTCondor-users] How to clean up pending token requests
Previous by thread: [HTCondor-users] Debugging HIBERNATE/ROOSTER, halfway done
Next by thread: Re: [HTCondor-users] Debugging HIBERNATE/ROOSTER, halfway done
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Debugging HIBERNATE/ROOSTER, halfway done