[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



On 02/06/2017 02:40 PM, Harald van Pee wrote:
Hello,

we got mpi running in parallel universe with htcondor 8.4 using openmpiscript
and its working in general without any problem.

In general, the MPI jobs themselves cannot survive a network outage or 
partition, even a temporary one.  HTCondor will reconnect the shadow to 
the starters, if the problem is just between the submit machine and the 
execute machines, but if the network problem also impacts node-to-node 
communication, then the job has to be aborted and restarted from scratch 
because of the way MPI works.
If possible, we would recommend that long-running jobs that suffer from 
this problem try to self-checkpoint themselves, so that when they are 
restarted, they don't need to be restarted from scratch.
-greg