Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
- Date: Mon, 06 Feb 2017 21:40:30 +0100
- From: Harald van Pee <pee@xxxxxxxxxxxxxxxxx>
- Subject: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
Hello,
we got mpi running in parallel universe with htcondor 8.4 using openmpiscript
and its working in general without any problem.
But from time to time we have some temporary network interruption, the jobs in
the vanilla universe (in general no mpi job) seems to have no problem and
the scheduler reconnects to startd.
But for the mpi job and only for this kind of job, the error message in the
subject occurs in the ShadowLog for one of the (in this case 50) starter of a
mpirun job.
Then unfortunately the complete mpirun job is restarted from the beginning
because of a Shadow exception!
Is it just a bug or can I change any configuration that condor will wait
longer for the missing starter?
Because its a temporary network problem on
a lot of hosts I am pretty sure that the running starter (program) still
exists on the missing host connection.
Best
Harald