[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] How to handle hibernated machines failing to wakeup?



Good morning,


a while ago I had asked for details and advice about hibernation (Shutdown)
and wakeup (Rooster), for a 250-node pool.

While some (reasonable-looking) rules for going Offline (after n hours of
being Idle/Unclaimed) have proven to work nicely, this was only the easy
part of the whole story.
(I could have stopped here since the request was only to save energy and
switch off the nodes... but I've also got to serve the users.)

For powering machines back on, we couldn't use WakeOnLAN (or the default
condor_power utility) but had to go for "ipmitool chassis power on" over
the management network instead (so there's code that could be adjusted).
This also works ... in most cases.

There are rare occasions where nodes don't come up though: over the past
three days, we "lost" five machines.

Since they have been matched against a job, they will enter the Rooster
cycle again and again, and even with more Offline nodes available, the
corresponding jobs will stay Idle until some other machine becomes free.
This is undesirable at least.

I'm now looking for a way to

(a) let the matchmaker forget about the assignment of the job to that
    machine (there must be a timeout somewhere?) and

(b) modify the NEGOTIATOR_PRE_JOB_RANK (I suppose this is the right one)
    to reorder Offline machines so this particular one gets ranked down
    /excluded in the next cycle (as long as there are other machines...)

and would appreciate suggestions

(Could MachineLastMatchTime be used for (b)? How to balance it against
LastHeardFrom which is already used to get even "wear"? What else comes
to mind?)
(Should the wakeup script rather check whether the machine is already
powered on, and possibly set it to Absent if it looks dead?)

and hints to documentation chapters that have escaped my ageing eyes
so far!


Thanks,
 Steffen


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~