Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
- Date: Mon, 06 Feb 2017 23:43:47 +0100
- From: Harald van Pee <pee@xxxxxxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
There is one important argument, why I think the problem is condor related not
mpi (of course I can be wrong).
The condor communication goes via ethernet, and the ethernet connection has a
problem for several minutes.
The mpi communication goes via infiniband, and there is no infiniband problem
during this time.
Harald
On Monday 06 February 2017 23:04:01 Harald van Pee wrote:
> Hi Greg,
>
> thanks for your answer.
>
> On Monday 06 February 2017 22:18:08 Greg Thain wrote:
> > On 02/06/2017 02:40 PM, Harald van Pee wrote:
> > > Hello,
> > >
> > > we got mpi running in parallel universe with htcondor 8.4 using
> > > openmpiscript and its working in general without any problem.
> >
> > In general, the MPI jobs themselves cannot survive a network outage or
> > partition, even a temporary one. HTCondor will reconnect the shadow to
> > the starters, if the problem is just between the submit machine and the
> > execute machines, but if the network problem also impacts node-to-node
> > communication, then the job has to be aborted and restarted from scratch
> > because of the way MPI works.
>
> The problem seems between submit machine and one running node (not the node
> where mpirun was started).
> If you are right it should be possible to get or found an error of mpirun
> because it lost one node right?
> But it seems condor kills the job because of a shadow exception.
> Unfortunatelly we do not see the output of the stoped job because its
> overwritten by the new started.
> Any suggestion how to find out if its realy an mpi related problem?
>
> > If possible, we would recommend that long-running jobs that suffer from
> > this problem try to self-checkpoint themselves, so that when they are
> > restarted, they don't need to be restarted from scratch.
> >
> > -greg
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> > a subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
--
Harald van Pee
Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505
mail: pee@xxxxxxxxxxxxxxxxx