Well, I think the problem is becoming apparent.Condor launches a Startd, which calls the mp2script.sh which runs mpdboot to start the mvapich2 daemon on all the machines, and then the script runs mpiexec to launch the job. When you do a condor_rm it kills the startd, and it kills the mpd ring, but that leaves the actual mpiexec jobs orphaned. somehow i need condor to keep track of what the script is running, and clean up after itself.
On Aug 20, 2009, at 13:34 , Peter Doherty wrote:
I turned on D_NETWORK debugging on the STARTED from the machine with the zombie processes: 8/20 13:31:24 ACCEPT from=<10.0.10.43:47668> newfd=6 to=<10.0.10.43:34330> 8/20 13:31:24 condor_read(fd=6 <10.0.10.43:47668>,,size=4,timeout=1,flags=2) 8/20 13:31:24 condor_read(fd=6 <10.0.10.43:47668>,,size=5,timeout=1,flags=0) 8/20 13:31:24 condor_read(fd=6 <10.0.10.43:47668>,,size=698,timeout=1,flags=0) 8/20 13:31:24 encrypting secret 8/20 13:31:24 condor_read(fd=6 <10.0.10.43:47668>,,size=5,timeout=20,flags=0) 8/20 13:31:24 condor_read(fd=6 <10.0.10.43:47668>,,size=74,timeout=20,flags=0) 8/20 13:31:24 Stream::get(int) incorrect pad received: 4b 8/20 13:31:24 Can't read ClaimId 8/20 13:31:24 condor_write(fd=6 <10.0.10.43:47668>,,size=13,timeout=20,flags=0) 8/20 13:31:24 condor_write(): socket 6 is readable 8/20 13:31:24 condor_write(): Socket closed when trying to write 13 bytes to <10.0.10.43:47668>, fd is 6 8/20 13:31:24 Buf::write(): condor_write() failed from the machine running the master MPI process: 8/20 13:31:00 ACCEPT from=<10.0.10.43:55702> newfd=6 to=<10.0.10.42:54879> 8/20 13:31:00 condor_read(fd=6 <10.0.10.43:55702>,,size=4,timeout=1,flags=2) 8/20 13:31:00 condor_read(fd=6 <10.0.10.43:55702>,,size=5,timeout=1,flags=0) 8/20 13:31:00 condor_read(fd=6 <10.0.10.43:55702>,,size=699,timeout=1,flags=0) 8/20 13:31:00 encrypting secret 8/20 13:31:00 condor_read(fd=6 <10.0.10.43:55702>,,size=5,timeout=20,flags=0) 8/20 13:31:00 condor_read(fd=6 <10.0.10.43:55702>,,size=74,timeout=20,flags=0) 8/20 13:31:00 Stream::get(int) incorrect pad received: ffffffc0 8/20 13:31:00 Can't read ClaimId 8/20 13:31:00 condor_write(fd=6 <10.0.10.43:55702>,,size=13,timeout=20,flags=0) 8/20 13:31:00 condor_write(): socket 6 is readable 8/20 13:31:00 condor_write(): Socket closed when trying to write 13 bytes to <10.0.10.43:55702>, fd is 6 8/20 13:31:00 Buf::write(): condor_write() failed 8/20 13:31:00 CLOSE <10.0.10.42:54879> fd=6 On Aug 20, 2009, at 12:54 , Peter Doherty wrote:I finally got an MPI job to run on a couple nodes in the cluster from a condor job. When when I do a condor_rm on the job, it only dies on the node that is running the master process. I've got these errors in my StartLog 8/20 12:52:05 Can't read ClaimId 8/20 12:52:05 condor_write(): Socket closed when trying to write 13 bytes to <10.0.10.43:37598>, fd is 6 8/20 12:52:05 Buf::write(): condor_write() failed10.0.10.43 is the node that still has the child process running, and Ihave to manually kill it. Any thoughts. Thanks. --Peter _______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/_______________________________________________ Condor-users mailing listTo unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with asubject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/
Attachment:
smime.p7s
Description: S/MIME cryptographic signature