Recently our cluster (running condor 6.7.18) experienced an impossibly high load (~800) due to many many globus-job-manager scripts running. The cluster was fully utilized with ~200 running jobs, but there were ~500 or more globus-job-manager scripts running. At one point when I was able to run condor commands, it reported that there were ~3000 jobs in the queue, most of them idle. Unfortunately, I was often unable to run condor_q, condor_rm, or any other condor command during this time due to the high load. This prevented me from being able to remove the idle jobs from the queue and kill the running jobs. Is there a backdoor way that I can manipulate the condor queue to remove jobs, without having to go through condor_rm? Or are there any suggestions on how to recover from an overloaded queue? --Mike
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature