On 7/22/19 10:53 AM, Shawn A Kwang
wrote:
I have a couple of "best-practices questions" for condor cluster administration. Is it safe to run 'condor_restart' (-graceful) on a running condor pool components? Of course you may ask: what do I mean by 'safe'? Let me ask this question another way. What happens if I run condor_restart on a 1) Central manager, 2) Submit node (running schedd), or 3) Compute node? All while users are actively running jobs.
Shawn:
This is a great question. Assuming everything comes back after a restart, a restart of
o) The central manager. All running jobs stay running. No new matches can be made. Schedds can start new jobs running only by using existing matches for the same user. condor_status doesn't work while the collector is down. o) Submit node. All running jobs stay running for up to the lease duration. If the schedd comes back before the job lease expires, it reconnects to the running jobs and the jobs stay running. If the schedd is down for too long, the jobs get preempted and go back to idle. The default job lease duration is 20 minutes. o) Execute machines. All running jobs on that execute machine are preempted and killed. The schedd will notice the jobs have been preempted, mark them as Idle, and try to restart them again from scratch.
-greg
Thanks in advance. Sincerely, Shawn
|