[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Preventing excuting jobs from terminating when aschedd machine reboots



I've been playing around with this and if I have a Windows XP machine
running jobs on a remote Windows XP machine I can reboot my machine if I
kill the condor processes (using the kill command from the MKS toolkit)
in the following order:

kill -9 <all condor_shadow processes> <condor_master process>
<condor_schedd process>

This leaves the jobs running on the remote startd's and I can reboot my
machine. However, when the machine comes back online the schedd never
reconnects and re-establishs shadows with the running jobs. Indeed the
StaterLog.vm1 on my startd reads:

3/9 15:59:32 Create_Process succeeded, pid=1284
3/9 16:09:32 Process exited, pid=1284, status=0
3/9 16:09:32 condor_write(): send() returned -1, timeout=300,
errno=10054.  Assuming failure.
3/9 16:09:32 Buf::write(): condor_write() failed
3/9 16:09:32 Failed to send job exit status to shadow
3/9 16:09:32 JobExit() failed, waiting for job lease to expire or for a
reconnect attempt

And the ScheddLog on the machine that spawned the jobs says:

3/9 16:15:24 Negotiating for owner: ichesal@xxxxxxxxxx
3/9 16:15:24 Checking consistency running and runnable jobs
3/9 16:15:24 Tables are consistent
3/9 16:15:24 attempt to add pre-existing match
"<137.57.176.87:1743>#1107349436#1773" ignored
3/9 16:15:25 attempt to add pre-existing match
"<137.57.176.87:1743>#1107349436#1772" ignored
3/9 16:15:25 Out of servers - 2 jobs matched, 6 jobs idle, 1 jobs
rejected
3/9 16:15:25 Response problem from startd.
3/9 16:15:25 Sent RELEASE_CLAIM to startd on <137.57.176.87:1743>
3/9 16:15:25 Match record (<137.57.176.87:1743>, 1, 2) deleted
3/9 16:15:25 Response problem from startd.
3/9 16:15:25 Sent RELEASE_CLAIM to startd on <137.57.176.87:1743>
3/9 16:15:25 Match record (<137.57.176.87:1743>, 1, 3) deleted
3/9 16:15:39 Sent ad to central manager for ichesal@xxxxxxxxxx
3/9 16:15:39 Sent ad to 1 collectors for ichesal@xxxxxxxxxx

It looks like the schedd just plain cannot reconnect to this machine. Is
this expected? Can the schedd not re-spawn the shadows after a reboot so
those jobs don't have to waste all the that time?

- Ian



> We occasionally get forced patches pushed to our Windows 
> dektops by our IS department and because of this we are 
> subjected to forced reboots of our machines. This can be a 
> real pain if the forced reboot happens to coincide with a 
> time when you're trying to get a long running vanilla job 
> through the system. If I have a job that's been run for 2 
> days and needs another day, losing it to a forced reboot can 
> be really frustrating.
> 
> I've added:
> 
> job_lease_duration = 720
> 
> to my submission ticket. Submitted a series of jobs. Waited 
> for a few jobs to begin executing. And then rebooted the 
> windows machine from which they were scheduled. All the jobs 
> were immediately vacated.
> 
> How can I stop this from happening? My only guess is that 
> there's a shutdown routine in the schedd daemon that gets 
> called when the service is terminated that's actually 
> vacating the jobs on the startd's. Is this correct?
> 
> - Ian
> 
> --
> Ian R. Chesal <ichesal@xxxxxxxxxx>
> Senior Software Engineer
> 
> Altera Corporation
> Toronto Technology Center
> Tel: (416) 926-8300
> 
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
>