[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



Hi Todd,

thank you for your help.

Concerning the no route to host, I see no ethernet port down on any machine
during that time, but maybe a change in 
/etc/host.conf:
order hosts,bind

/etc/nsswitch.conf:
hosts:      files dns

instead of the debian default will help anyway, /etc/hosts has all ip 
addresses of all nodes in. 

Regards
Harald

On Tuesday 21 February 2017 21:06:20 Todd Tannenbaum wrote:
> On 2/21/2017 1:33 PM, Harald van Pee wrote:
> > It seems that openmpi (or mpi) are not used very often with htcondor and
> > the information is spare and I got some questions how I have managed it
> > to get it running at all. I will share all I know about this in a new
> > thread soon, or is there a wiki where I should put the information?
> 
> Off-list I put Harold in touch with the folks who can put Harold's info
> into the Manual or the HTCondor Wiki (from the web homepage, look for
> the link "HOWTO recipes" and "HTcondor Wiki").
> 
> Also we did some work for upcoming HTCondor v8.6.1 release so it works
> properly with the latest releases of OpenMPI - for details see
>    https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6024
> 
> > Now back to our problem:
> > One hint that it will be related to network (ethernet or infiniband) is,
> > that we have one job running for 11 days without problems as we have
> > less jobs running, and we got problems within a few days as we have
> > startetd 200 more jobs.
> > I have found now 2 independend parallel mpi jobs which share one machine
> > with one job each and there are no ethernet problems seen, not on the
> > scheduler machine nor on the starter node. Unfortunately there is no
> > error output in the jobs error file.
> > Its clear that condor kills the jobs but for me its unclear why, because
> > it seems both starter processes are still running if I understand the
> > logfiles correct.
> 
> At first blush, it looks to me like the condor_shadow on the submit node
> could no longer contact the execute node at IP address 192.168.123.37
> due to "No route to host".  The "No route to host" error comes the
> operating system, not from HTCondor - you can google this error and see
> lots of opinions/ideas on how to troubleshoot and fix, but basically
> there is no route for the execute node IP address in the client's
> routing table... not sure why this would happen all of the sudden, maybe
> some interface on your submit machine is being disabled or some switch
> port?
> 
> regards,
> Todd
> 
> > Maybe one of you find a hint in the condor log below and can give me a
> > hint what happens, or what I can do to find out.
> > 
> > Best
> > Harald
> > 
> > ShadowLog:
> > 02/19/17 03:09:44 (1744.0) (1729179): condor_read() failed: recv(fd=12)
> > returned -1, errno = 110 Connection timed out, reading 5 bytes from
> > startd at <192.168.123.37:30389>.
> > 02/19/17 03:09:44 (1745.0) (1729180): condor_read() failed: recv(fd=9)
> > returned -1, errno = 110 Connection timed out, reading 5 bytes from
> > startd at <192.168.123.37:30389>.
> > 02/19/17 03:09:44 (1744.0) (1729179): condor_read(): UNEXPECTED read
> > timeout after 0s during non-blocking read from startd at
> > <192.168.123.37:30389> (desired timeout=300s)
> > 02/19/17 03:09:44 (1745.0) (1729180): condor_read(): UNEXPECTED read
> > timeout after 0s during non-blocking read from startd at
> > <192.168.123.37:30389> (desired timeout=300s)
> > 02/19/17 03:09:44 (1744.0) (1729179): IO: Failed to read packet header
> > 02/19/17 03:09:44 (1745.0) (1729180): IO: Failed to read packet header
> > 02/19/17 03:09:44 (1744.0) (1729179): Can no longer talk to
> > condor_starter <192.168.123.37:30389>
> > 02/19/17 03:09:44 (1745.0) (1729180): Can no longer talk to
> > condor_starter <192.168.123.37:30389>
> > 02/19/17 03:09:44 (1744.0) (1729179): This job cannot reconnect to
> > starter, so job exiting
> > 02/19/17 03:09:44 (1745.0) (1729180): This job cannot reconnect to
> > starter, so job exiting
> > 02/19/17 03:09:47 (1745.0) (1729180): attempt to connect to
> > <192.168.123.37:30389> failed: No route to host (connect errno = 113).
> > 02/19/17 03:09:47 (1744.0) (1729179): attempt to connect to
> > <192.168.123.37:30389> failed: No route to host (connect errno = 113).
> > 02/19/17 03:09:47 (1745.0) (1729180): RemoteResource::killStarter():
> > Could not send command to startd
> > 02/19/17 03:09:47 (1744.0) (1729179): RemoteResource::killStarter():
> > Could not send command to startd
> > 02/19/17 03:09:47 (1744.0) (1729179): ERROR "Can no longer talk to
> > condor_starter <192.168.123.37:30389>" at line 209 in file
> > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> > 02/19/17 03:09:47 (1745.0) (1729180): ERROR "Can no longer talk to
> > condor_starter <192.168.123.37:30389>" at line 209 in file
> > /slots/02/dir_53434/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> > 
> > 
> > StarterLog of job 1745.0 on node 192.168.123.37
> > 02/15/17 17:14:34 (pid:751398) Create_Process succeeded, pid=751405
> > 02/15/17 17:14:35 (pid:751398) condor_write() failed: send() 1 bytes to
> > <127.0.0.1:10238> returned -1, timeout=0, errno=32 Broken pipe.
> > 02/19/17 03:10:05 (pid:751398) condor_read() failed: recv(fd=8) returned
> > -1, errno = 104 Connection reset by peer, reading 5 bytes from
> > <192.168.123.100:25500>.
> > 02/19/17 03:10:05 (pid:751398) IO: Failed to read packet header
> > 02/19/17 03:10:05 (pid:751398) Lost connection to shadow, waiting 2400
> > secs for reconnect
> > 02/19/17 03:10:05 (pid:751398) Got SIGTERM. Performing graceful shutdown.
> > 02/19/17 03:10:05 (pid:751398) ShutdownGraceful all jobs.
> > 02/19/17 03:10:05 (pid:751398) Process exited, pid=751405, status=0
> > 02/19/17 03:10:05 (pid:751398) Failed to send job exit status to shadow
> > 02/19/17 03:10:05 (pid:751398) Last process exited, now Starter is
> > exiting 02/19/17 03:10:05 (pid:751398) **** condor_starter
> > (condor_STARTER) pid 751398 EXITING WITH STATUS 0
> > 
> > StarterLog of job 1744.0 on node 92.168.123.37
> > 02/15/17 17:14:34 (pid:751399) Create_Process succeeded, pid=751400
> > 02/15/17 17:14:34 (pid:751399) condor_write() failed: send() 1 bytes to
> > <127.0.0.1:48689> returned -1, timeout=0, errno=32 Broken pipe.
> > 02/19/17 03:10:03 (pid:751399) condor_read() failed: recv(fd=8) returned
> > -1, errno = 104 Connection reset by peer, reading 5 bytes from
> > <192.168.123.100:34337>.
> > 02/19/17 03:10:03 (pid:751399) IO: Failed to read packet header
> > 02/19/17 03:10:03 (pid:751399) Lost connection to shadow, waiting 2400
> > secs for reconnect
> > 02/19/17 03:10:03 (pid:751399) Got SIGTERM. Performing graceful shutdown.
> > 02/19/17 03:10:03 (pid:751399) ShutdownGraceful all jobs.
> > 02/19/17 03:10:03 (pid:751399) Process exited, pid=751400, status=0
> > 02/19/17 03:10:03 (pid:751399) Failed to send job exit status to shadow
> > 02/19/17 03:10:03 (pid:751399) Last process exited, now Starter is
> > exiting 02/19/17 03:10:03 (pid:751399) **** condor_starter
> > (condor_STARTER) pid 751399 EXITING WITH STATUS 0
> > 
> > StartLog:
> > 02/19/17 03:09:48 slot1_11: Called deactivate_claim()
> > 02/19/17 03:09:48 slot1_11: Changing state and activity: Claimed/Busy ->
> > Preempting/Vacating
> > 02/19/17 03:09:48 slot1_13: Called deactivate_claim()
> > 02/19/17 03:09:48 slot1_13: Changing state and activity: Claimed/Busy ->
> > Preempting/Vacating
> > 02/19/17 03:10:03 Starter pid 751399 exited with status 0
> > 02/19/17 03:10:03 slot1_11: State change: starter exited
> > 02/19/17 03:10:03 slot1_11: State change: No preempting claim, returning
> > to owner
> > 02/19/17 03:10:03 slot1_11: Changing state and activity:
> > Preempting/Vacating -
> > 
> >> Owner/Idle
> > 
> > 02/19/17 03:10:03 slot1_11: State change: IS_OWNER is false
> > 02/19/17 03:10:03 slot1_11: Changing state: Owner -> Unclaimed
> > 02/19/17 03:10:03 slot1_11: Changing state: Unclaimed -> Delete
> > 02/19/17 03:10:03 slot1_11: Resource no longer needed, deleting
> > 02/19/17 03:10:05 Starter pid 751398 exited with status 0
> > 02/19/17 03:10:05 slot1_13: State change: starter exited
> > 02/19/17 03:10:05 slot1_13: State change: No preempting claim, returning
> > to owner
> > 02/19/17 03:10:05 slot1_13: Changing state and activity:
> > Preempting/Vacating -
> > 
> >> Owner/Idle
> > 
> > 02/19/17 03:10:05 slot1_13: State change: IS_OWNER is false
> > 02/19/17 03:10:05 slot1_13: Changing state: Owner -> Unclaimed
> > 02/19/17 03:10:05 slot1_13: Changing state: Unclaimed -> Delete
> > 02/19/17 03:10:05 slot1_13: Resource no longer needed, deleting
> > 02/19/17 03:19:48 Error: can't find resource with ClaimId
> > (<192.168.123.37:30389>#1481221329#1484#...) for 443 (RELEASE_CLAIM);
> > perhaps this claim was removed already.
> > 02/19/17 03:19:48 condor_write(): Socket closed when trying to write 13
> > bytes to <192.168.123.100:20962>, fd is 8
> > 02/19/17 03:19:48 Buf::write(): condor_write() failed
> > 02/19/17 03:19:48 Error: can't find resource with ClaimId
> > (<192.168.123.37:30389>#1481221329#1487#...) for 443 (RELEASE_CLAIM);
> > perhaps this claim was removed already.
> > 02/19/17 03:19:48 condor_write(): Socket closed when trying to write 13
> > bytes to <192.168.123.100:34792>, fd is 8
> > 02/19/17 03:19:48 Buf::write(): condor_write() failed
> > 
> > On Tuesday 07 February 2017 19:55:31 Harald van Pee wrote:
> >> Dear experts,
> >> 
> >> I have some questions for debugging:
> >> Can I avoid restarting of a job in vanilla and/or parallel universe if I
> >> use Requirements = (NumJobStarts==0)
> >> in the condor submit description file?
> >> If it works, will the job stay idle or will be removed?
> >> 
> >> I found a job in the vanilla universe started at 12/9 and restarted
> >> shortly before Christmas and still running. I assume the reason were
> >> also network problems, but unfortunatelly our last condor and system
> >> log files are from January.
> >> Is there any possibility to make condor a little bit more robust against
> >> network problems via configuration? Just wait a little bit longer or
> >> make more reconnection tries?
> >> 
> >> We are working on automatic restart of the mpi jobs and try to use more
> >> frequent checkpoints, but it seems a lot of work, therefore any idea
> >> would be welcome.
> >> 
> >> Best
> >> Harald
> >> 
> >> On Monday 06 February 2017 23:43:47 Harald van Pee wrote:
> >>> There is one important argument, why I think the problem is condor
> >>> related not mpi (of course I can be wrong).
> >>> The condor communication goes via ethernet, and the ethernet connection
> >>> has a problem for several minutes.
> >>> The mpi communication goes via infiniband, and there is no infiniband
> >>> problem during this time.
> >>> 
> >>> Harald
> >>> 
> >>> On Monday 06 February 2017 23:04:01 Harald van Pee wrote:
> >>>> Hi Greg,
> >>>> 
> >>>> thanks for your answer.
> >>>> 
> >>>> On Monday 06 February 2017 22:18:08 Greg Thain wrote:
> >>>>> On 02/06/2017 02:40 PM, Harald van Pee wrote:
> >>>>>> Hello,
> >>>>>> 
> >>>>>> we got mpi running in parallel universe with htcondor 8.4 using
> >>>>>> openmpiscript and its working in general without any problem.
> >>>>> 
> >>>>> In general, the MPI jobs themselves cannot survive a network outage
> >>>>> or partition, even a temporary one.  HTCondor will reconnect the
> >>>>> shadow to the starters, if the problem is just between the submit
> >>>>> machine and the execute machines, but if the network problem also
> >>>>> impacts node-to-node communication, then the job has to be aborted
> >>>>> and restarted from scratch because of the way MPI works.
> >>>> 
> >>>> The problem seems between submit machine and one running node (not the
> >>>> node where mpirun was started).
> >>>> If you are right it should be possible to get or found an error of
> >>>> mpirun because it lost one node right?
> >>>> But it seems condor kills the job because of a shadow exception.
> >>>> Unfortunatelly we do not see the output of the stoped job because its
> >>>> overwritten by the new started.
> >>>> Any suggestion how to find out if its realy an mpi related problem?
> >>>> 
> >>>>> If possible, we would recommend that long-running jobs that suffer
> >>>>> from this problem try to self-checkpoint themselves, so that when
> >>>>> they are restarted, they don't need to be restarted from scratch.
> >>>>> 
> >>>>> -greg
> >>>>> _______________________________________________
> >>>>> HTCondor-users mailing list
> >>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> >>>>> with a subject: Unsubscribe
> >>>>> You can also unsubscribe by visiting
> >>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >>>>> 
> >>>>> The archives can be found at:
> >>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
> >> 
> >> _______________________________________________
> >> HTCondor-users mailing list
> >> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> >> with a subject: Unsubscribe
> >> You can also unsubscribe by visiting
> >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >> 
> >> The archives can be found at:
> >> https://lists.cs.wisc.edu/archive/htcondor-users/
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> > a subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Harald van Pee

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505
mail: pee@xxxxxxxxxxxxxxxxx