Hello,
I have been using Condor (7.4.0 I think) on our cluster
running on Fedora 10 (x86_64) for around 2 years now. Recently upgraded
our cluster to Fedora 14 and installed the Condor package using "yum
install condor", which installed Condor 7.4.2.
After setting up the condor.config and condor.config.local files as
before (but accounting for the changes between 7.4.0 and 7.4.2), the
cluster works fine for normal "vanilla" jobs.
All the machines
have ports 9600-9700 as well as ports 4400-5000 (required by the sshd.sh
script when running parallel jobs) open, and machines within the
cluster are given trusted full access using specific firewall rules.
Test programs which check if "mpirun" runs across multiple machines
on the cluster all work fine (indicating that passwordless ssh and the
firewall settings are all ok).
When trying to run an OpenMPI
parallel job using the sample "openmpiscript" wrapper, the job refuses
to go through due to errors thrown by "condor_chirp" very early on
during the job execution process.
Basically, the "openmpiscript" wrapper calls
"/usr/libexec/condor/sshd.sh" in order to prepare the ssh environment
(key generation, passing keys between the machines, and starting the ssh
server daemon) before finally running "mpirun".
Within the "sshd.sh" script, after generation of the "hostkey" and
"idkey" on the respective machines, condor uses "condor_chirp put -perm
0700 $idkey _CONDOR_REMOTE_SPOOL_DIR/_