[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] First experience with the parallel universe.



Thanks for the reply, Jason.

I'm not using partitionable slots, but just in case, I went ahead and
upgraded to 8.6.5 and had the user try again. Alas, still failing in the
same way.

Job executing on host: MPI_job

Greetings from all the nodes

Job started on all the nodes, and then 

007 (070.000.000) 09/12 15:55:58 Shadow exception!
        Assertion ERROR on (nextResourceToStart == numNodes)
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job

And the whole thing starts over again. And over and over and over.

And if it follows the same trend, eventually it will run. After 70 failures. That's one of the most mystifying parts.

If anyone else has any thoughts or things I might try or log files I might look in to figure this out, I'm desperately in need of all those things.


On Tue, Sep 12, 2017 at 10:32:38AM -0500, Jason Patton wrote:
> Amy,
> 
> Are you using partitionable slots on these execute nodes? If so, if
> you're able to upgrade to 8.6.5, it's possible that a recent bugfix
> might have taken care of this.
> 
> https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6308
> 
> Jason Patton
> 
> On Mon, Sep 11, 2017 at 4:13 PM, Amy Bush <amy@xxxxxxxxxxxxx> wrote:
> > We're running a fairly long-running instantiation of htcondor here, but
> > only just recently has one of my users decided to try out the parallel
> > universe. It isn't going flawlessly, and I thought maybe someone has
> > seen this before and might be able to help. Hopefully I'm just doing or
> > overlooking something dumb and obvious.
> >
> > condor 8.6.4, runs flawlessly normally.
> >
> > parallel job is submitted, excerpt from user's log file:
> >
> > 001 (066.000.000) 09/10 22:40:11 Job executing on host: MPI_job
> > ...
> > 008 (066.000.000) 09/10 22:40:11 Greetings and felicitations from node 6 of 13
> > ...
> > 008 (066.000.000) 09/10 22:40:11 Greetings and felicitations from node 7 of 13
> > ...
> > (snip)
> > ...
> > 008 (066.000.000) 09/10 22:40:11 Starting Orc follower node 6
> > ...
> > 008 (066.000.000) 09/10 22:40:11 Starting Orc follower node 7
> > ...
> > (snip)
> > ...
> > 008 (066.000.000) 09/10 22:40:11 All 12 followers found, Starting Orc
> > leader
> > ...
> > 007 (066.000.000) 09/10 22:40:48 Shadow exception!
> >      Assertion ERROR on (nextResourceToStart == numNodes)
> >      0  -  Run Bytes Sent By Job
> >      0  -  Run Bytes Received By Job
> >
> > It will then retry. And retry. And retry. And then run successfully.
> > Evidently it retried 70 times last night before it was ultimately
> > successful. On the same machine it had failed on up until then.
> >
> > Looking in the StarterLogs for that host:
> >
> > 09/11/17 15:30:44 (pid:388) condor_read() failed: recv(fd=9) returned -1, errno
> > = 104 Connection reset by peer, reading 5 bytes from <(dedicated hostname redacted):46369>.
> > 09/11/17 15:30:44 (pid:388) IO: Failed to read packet header
> > 09/11/17 15:30:44 (pid:388) i/o error result is 0, errno is 104
> > 09/11/17 15:30:44 (pid:388) condor_write(): Socket closed when trying to write 2
> > 1 bytes to <dedicated hostname redacted):46369>, fd is 9
> > 09/11/17 15:30:44 (pid:388) Buf::write(): condor_write() failed
> > 09/11/17 15:30:44 (pid:388) i/o error result is 0, errno is 0
> > 09/11/17 15:30:44 (pid:388) condor_write(): Socket closed when trying to write 1
> > 52 bytes to <(dedicated hostname redacted):46369>, fd is 9
> > 09/11/17 15:30:44 (pid:388) Buf::write(): condor_write() failed
> > 09/11/17 15:30:44 (pid:388) ERROR "Assertion ERROR on (result)" at line 902 in f
> > ile /slots/03/dir_36001/sources/src/condor_starter.V6.1/NTsenders.cpp
> > 09/11/17 15:30:44 (pid:388) condor_write(): Socket closed when trying to write 1
> > 82 bytes to <(dedicated hostname redacted):46369>, fd is 9
> > 09/11/17 15:30:44 (pid:388) Buf::write(): condor_write() failed
> > 09/11/17 15:30:44 (pid:388) ERROR "Assertion ERROR on (result)" at line 902 in f
> > ile /slots/03/dir_36001/sources/src/condor_starter.V6.1/NTsenders.cpp
> > 09/11/17 15:30:48 (pid:588) ****************************************************
> > **
> > 09/11/17 15:30:48 (pid:588) ** condor_starter (CONDOR_STARTER) STARTING UP
> >
> >
> > As I said.. we have no experience with the parallel universe, so I'm not sure what direction to explore. We have a dedicated submit node for it, several dedicated hosts to run the jobs, and it will successfully run tiny test cases. (And there are many, many available nodes.)
> >
> > google hasn't been able to help me so far, so I turn to you. You guys. Yous. Y'all. Any hints you might be able to provide would be deeply appreciated.
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/