[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] mpi jobs not dying properly



I finally got an MPI job to run on a couple nodes in the cluster from a condor job.

When when I do a condor_rm on the job, it only dies on the node that is running the master process.
I've got these errors in my StartLog

8/20 12:52:05 Can't read ClaimId
8/20 12:52:05 condor_write(): Socket closed when trying to write 13 bytes to <10.0.10.43:37598>, fd is 6
8/20 12:52:05 Buf::write(): condor_write() failed


10.0.10.43 is the node that still has the child process running, and I have to manually kill it.

Any thoughts.
Thanks.

--Peter