Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
- Date: Tue, 07 Feb 2017 19:55:31 +0100
- From: Harald van Pee <pee@xxxxxxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
Dear experts,
I have some questions for debugging:
Can I avoid restarting of a job in vanilla and/or parallel universe if I use
Requirements = (NumJobStarts==0)
in the condor submit description file?
If it works, will the job stay idle or will be removed?
I found a job in the vanilla universe started at 12/9 and restarted shortly
before Christmas and still running. I assume the reason were also network
problems, but unfortunatelly our last condor and system log files are from
January.
Is there any possibility to make condor a little bit more robust against
network problems via configuration? Just wait a little bit longer or make more
reconnection tries?
We are working on automatic restart of the mpi jobs and try to use more
frequent checkpoints, but it seems a lot of work, therefore any idea
would be welcome.
Best
Harald
On Monday 06 February 2017 23:43:47 Harald van Pee wrote:
> There is one important argument, why I think the problem is condor related
> not mpi (of course I can be wrong).
> The condor communication goes via ethernet, and the ethernet connection has
> a problem for several minutes.
> The mpi communication goes via infiniband, and there is no infiniband
> problem during this time.
>
> Harald
>
> On Monday 06 February 2017 23:04:01 Harald van Pee wrote:
> > Hi Greg,
> >
> > thanks for your answer.
> >
> > On Monday 06 February 2017 22:18:08 Greg Thain wrote:
> > > On 02/06/2017 02:40 PM, Harald van Pee wrote:
> > > > Hello,
> > > >
> > > > we got mpi running in parallel universe with htcondor 8.4 using
> > > > openmpiscript and its working in general without any problem.
> > >
> > > In general, the MPI jobs themselves cannot survive a network outage or
> > > partition, even a temporary one. HTCondor will reconnect the shadow to
> > > the starters, if the problem is just between the submit machine and the
> > > execute machines, but if the network problem also impacts node-to-node
> > > communication, then the job has to be aborted and restarted from
> > > scratch because of the way MPI works.
> >
> > The problem seems between submit machine and one running node (not the
> > node where mpirun was started).
> > If you are right it should be possible to get or found an error of mpirun
> > because it lost one node right?
> > But it seems condor kills the job because of a shadow exception.
> > Unfortunatelly we do not see the output of the stoped job because its
> > overwritten by the new started.
> > Any suggestion how to find out if its realy an mpi related problem?
> >
> > > If possible, we would recommend that long-running jobs that suffer from
> > > this problem try to self-checkpoint themselves, so that when they are
> > > restarted, they don't need to be restarted from scratch.
> > >
> > > -greg
> > > _______________________________________________
> > > HTCondor-users mailing list
> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> > > with a subject: Unsubscribe
> > > You can also unsubscribe by visiting
> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > >
> > > The archives can be found at:
> > > https://lists.cs.wisc.edu/archive/htcondor-users/