[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] CM Failover with submits from CM




Janzen Brewer wrote:
Dan Bradley wrote:
Condor supports fail-over of the submit node.
I understand that the submit node can be failed over, but I'm curious as 
to what happens to the output of a completed job if the submit node from 
which it was submitted failed during its execution. Does the execute 
node keep the output until the secondary submit node undergoes failback? 
Or does it attempt to write it to the same directory on the secondary 
submit node?
  
I don't know much about schedd failover.

I think the directories where output is to be stored would all need to be on a shared disk accessible to both submit nodes. Jobs that are running when the primary submit node fails will wait for up to the job lease duration (default 20 minutes) for the secondary submit node to take over. When the job finishes, whether if finishes during that time or after that time, the output would get copied back to the functioning submit node onto the shared disk.
Of course, if you do all this only to make the shared filesystem into a 
single point of failure, you've probably only made things slightly worse.
--Dan