Hello,
We have made a new installation of condor in our cluster, in the
beginning of this week. In this new installation we upgraded from condor
6.8 to version 7.4 and we also changes our dedicated scheduler from a
machine with a Fedora OS to one with Ubuntu 10.04.
We have condor installed in 2 shared directories (one that has binaries
for fedora OS and another that has binaries for ubuntu OS) and each
machine runs the release correspondent to its OS. Everything ran fine in
the first days (from Monday until today), but today the condor commands
started getting stuck. Fist condor_q stopped responding and after a few
minutes all the jobs just died (without our intervention). We then
restarted condor in all our machines, resubmitted the jobs and the same
thing happened again after a while (about 15 minutes). Next, we cleaned
all our condor log files, killed the deamon in all the machines and
restarted the system and submitted a small number of jobs to see how it
handled them. Everything was ok for a few hours and now, I'm trying to
submit more jobs and the command condor_submit gets stuck. The strangest
thing is that the jobs are submitted and start running, but the
condor_submit command does not terminate by itself.
All our system is based on nfs.
Can anyone help?