On 02/06/2017 02:40 PM, Harald van Pee wrote:
Hello, we got mpi running in parallel universe with htcondor 8.4 using openmpiscript and its working in general without any problem.
In general, the MPI jobs themselves cannot survive a network outage or partition, even a temporary one. HTCondor will reconnect the shadow to the starters, if the problem is just between the submit machine and the execute machines, but if the network problem also impacts node-to-node communication, then the job has to be aborted and restarted from scratch because of the way MPI works.
If possible, we would recommend that long-running jobs that suffer from this problem try to self-checkpoint themselves, so that when they are restarted, they don't need to be restarted from scratch.
-greg