[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp

Date: Mon, 06 Feb 2017 23:43:47 +0100
From: Harald van Pee <pee@xxxxxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp

There is one important argument, why I think the problem is condor related not
mpi (of course I can be wrong).
The condor communication goes via ethernet, and the ethernet connection has a 
problem for several minutes.
The mpi communication goes via infiniband, and there is no infiniband problem 
during this time.

Harald


On Monday 06 February 2017 23:04:01 Harald van Pee wrote:
> Hi Greg,
> 
> thanks for your answer.
> 
> On Monday 06 February 2017 22:18:08 Greg Thain wrote:
> > On 02/06/2017 02:40 PM, Harald van Pee wrote:
> > > Hello,
> > > 
> > > we got mpi running in parallel universe with htcondor 8.4 using
> > > openmpiscript and its working in general without any problem.
> > 
> > In general, the MPI jobs themselves cannot survive a network outage or
> > partition, even a temporary one.  HTCondor will reconnect the shadow to
> > the starters, if the problem is just between the submit machine and the
> > execute machines, but if the network problem also impacts node-to-node
> > communication, then the job has to be aborted and restarted from scratch
> > because of the way MPI works.
> 
> The problem seems between submit machine and one running node (not the node
> where mpirun was started).
> If you are right it should be possible to get or found an error of mpirun
> because it lost one node right?
> But it seems condor kills the job because of a shadow exception.
> Unfortunatelly we do not see the output of the stoped job because its
> overwritten by the new started.
> Any suggestion how to find out if its realy an mpi related problem?
> 
> > If possible, we would recommend that long-running jobs that suffer from
> > this problem try to self-checkpoint themselves, so that when they are
> > restarted, they don't need to be restarted from scratch.
> > 
> > -greg
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> > a subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Harald van Pee

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505
mail: pee@xxxxxxxxxxxxxxxxx

Follow-Ups:
- Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
  - From: Harald van Pee

References:
- [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
  - From: Harald van Pee
- Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
  - From: Greg Thain
- Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
  - From: Harald van Pee

Prev by Date: Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
Next by Date: [HTCondor-users] Idle Jobs
Previous by thread: Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
Next by thread: Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
Index(es):
- Date
- Thread