Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] offline compute nodes and Rooster
- Date: Fri, 15 Oct 2010 10:57:40 -0500
- From: Dan Bradley <dan@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] offline compute nodes and Rooster
On 10/15/10 8:12 AM, Paul Haldane wrote:
Me again - I just need to check my understanding of how power management and Rooster should work.
This is 7.4.3 on a Linux central collector and 7.4.2 on Windows 7 compute nodes.
Behaviour I'm seeing is that compute nodes aren't being powered up to service queued jobs. I submitted a batch yesterday evening (after all the machines had gone to sleep). Nothing in RoosterLog to indicate that
the system requested WoL of any workers. condor_status didn't show any nodes (though entries were written to offline.log for the hibernating machines). Doing some more experimentation shows that compute nodes appear in condor_status for a while (10-20 minutes) after the machines hibernate.
Can I just check my understanding of what should be happening ...
1. Condor on the compute nodes sends ADs to collector when Offline becomes true (idle and not claimed for at least a minute). This information is stored in offline.log. This bit is working as I expect.
Are you sure this part is working? I'm worried that if Condor tries and
fails to hibernate the machine that it may send an ad that is not an
"Offline" ad and this will remove the Offline ad from the persistent store.
When ads are removed from the persistent store, you should see a line in
offline.log beginning with 102, which is the DestroyClassAd command.
2. If condor was in control of power Condor on the compute node would then put itself to sleep. We don't use that functionality; instead some other process does the hibernation (with logic to not hibernate if Condor is running a job).
3. Offline slots _should_ (I think they should, but would like confirmation) continue to appear in the output of condor_status (using -constraint Offline to just see offline slots). In our environment they only appear for 10/20 minutes after powering off. This isn't what I expect because OFFLINE_EXPIRE_ADS_AFTER defaults to maxint.
Yes, the offline ads should remain visible in condor_status. They
should not expire in 30 minutes if you are using the default
OFFLINE_EXPIRE_ADS_AFTER.
4. Hibernating compute nodes should be woken up by Rooster on the collector - it will only wake nodes which are visible in condor_status (again that's what I think - am I right?). This doesn't work for us because the offline nodes are only visible for a short time after node hibernates.
Yes, you are right. If you are using the default unhibernate
expression, what should happen is that some job will get matched to one
of the offline ads and this will result in MachineLastMatchTime getting
updated. Once that happens, the machine's Unhibernate expression should
become true, which should cause Rooster to try to wake it up.
What I've observed here is that if Condor decides that it needs a node's resource in the short time between it becoming Offline and disappearing from condor_status then it will try to wake it (often it's already awake).
(a) is my mental model right? If not please point me at the right docs (I might be just missing something obvious - just like the last problem I was having).
(b) Is the step that's missing in our environment the hibernation under condor's control. Do the condor daemons at that point send a message to collector saying "please remember me while I'm asleep"?
Yes. Directly before hibernation, condor_startd sends an Offline ad to
the collector, which is basically "please remember me while I'm
asleep". Any ClassAd sent by the startd that is not an offline ad will
remove the persistent Offline ad, on the assumption that the machine has
now woken back up. This is unfortunate, because it doesn't very well
support external hibernation.
It is possible to generate offline ads with condor_advertise (by setting
Offline=True), so you could generate the offline ads that way, once you
observe that the machine has gone away. This is what Ian Smith has done
using an external script that he wrote.
--Dan