Hi Steve, Steven Timm wrote:
How can I put a single node in a condor pool into a 'drainoff' state, that is, let any jobs currently running on the node finish, but don't accept new jobs.It should be: condor_off -peaceful In theory that will shut down the machines once all the running jobs leave. In practice I find if one job takes an incredibly long time to run new jobs keep getting assigned to the machine and a peaceful point to shut down is never reached. That's with 6.8.6 (yea, Condor guys, I know: why don't I tell you about these things? Sometimes it just slips my mind... :) ).In practice I've found two gotchas with this approach (1) you have to execute condor_off -peaceful individually for each startd in the pool. If you just do a global condor_off -peaceful it will kill the schedd's and negotiators well before the startd's go off and you won't have the desired result. (the jobs will all finish but condor will never know about it). They need a feature added toautomatically do the startd's first and then the schedd's and collector/negotiators.
I think the subject was a bit misleading. I really meant 'node drainoff', not 'pool drainoff'. I'm confusing the condor and dCache concepts of a 'pool'.
(2) If you execute condor_off -peaceful for a lot of nodes in rapid succession it will send the collector into a dance of death from which it can take hours to extract itself and condor_status will time out in the meantime. Supposedly that will be fixed in condor 7.0.2. The other two features I've wanted for a long time are (1) an instruction to tell a schedd to start all its existing jobs but not accept any more new ones. Also (2) an instruction to let existing jobs on a schedd complete but not start any more new ones. (yes I know the latter could be accomplished with condor_hold -constraint ...)
(2) is precisely the feature I was trying to use, except on a startd, not a schedd. If it's not currently possible with 7.0.0, then I'll just have to continue the tedious practice of watching for specific nodes to become idle, then shutting condor off. Otherwise I could just shut condor off while jobs are running, but I don't like to kill jobs that have been running for several hours.
--Mike
I thought I could do this by setting 'START=False' in the node-specific condor_config.local, followed by 'condor_reconfig -subsystem startd' on the node, but that doesn't seem to have worked. The node is still starting new jobs.Hmm...try: condor_reconfig -startd -full But my gut feeling that is that START = False is going to immediately vacate the running jobs. - Ian Confidentiality Notice. This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution, or copying of this message, or any attachments, is strictly prohibited. If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments. Thank you. _______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/_______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-usersThe archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/