HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] sshd.sh shared filesystem bug report / fix



I have been going round and round trying to get openmpi working under condor
and finally was able to do so only after making a bugfix to sshd.sh.  We run
a cluster of 8x 8 core systems with shared filesystem.  I have recently
upgraded to 7.5.4 but I do not believe that the version plays a part in this
bug since my copy of sshd.sh in 6.9 was virtually identical.

The application in question was vasp and it likes to have a shared
filesystem to do its work so I had held back on transferring files as part
of the job.  This is specifically what triggers the bug.

The problem is that the check to see if sshd came up was outputting to
sshd.out in the CWD rather than a safe location and was causing all sorts of
nasty race conditions that left the state of the various sshd's in question.
Thankfully the fix was quite simple; to give each sshd a safe location to
write it's stdout which would not be in contention.

Once that fix was in place the openmpiscript worked as expected even on a
shared filesystem.  I also believe this fix might benefit others that depend
on the sshd.sh script since this will make the script more consistent.

Cheers,
Eric Warnke

Research IT
State University of New York at Albany

--- /usr/libexec/condor/sshd.sh 2010-10-18 16:22:25.000000000 -0400
+++ /network/rit/misc/devel/condor/sshd.sh      2010-10-22
13:21:03.000000000 -0400
@@ -81,13 +81,13 @@
 do

 # Try to launch sshd on this port
-$SSHD -p$PORT -oAuthorizedKeysFile=${idkey}.pub -h$hostkey -De -f/dev/null
-oStrictModes=no -oPidFile=/dev/null -oAcceptEnv=_CONDOR < /dev/null >
sshd.out 2>&1 &
+$SSHD -p$PORT -oAuthorizedKeysFile=${idkey}.pub -h$hostkey -De -f/dev/null
-oStrictModes=no -oPidFile=/dev/null -oAcceptEnv=_CONDOR < /dev/null >
$_CONDOR_SCRATCH_DIR/tmp/sshd.out 2>&1 &

 pid=$!

 # Give sshd some time
 sleep 2
-if grep "Server listening" sshd.out > /dev/null 2>&1
+if grep "Server listening" $_CONDOR_SCRATCH_DIR/tmp/sshd.out > /dev/null
2>&1
 then
        done=1
 else
@@ -99,7 +99,7 @@
 done

 # Don't need this anymore
-rm sshd.out
+rm $_CONDOR_SCRATCH_DIR/tmp/sshd.out

 # create contact file
 hostname=`hostname`