Running Condor 8.2.8 here, and am experiencing a lack of responsiveness when submitting jobs (this is mostly unusual), or running âcondor_qâ or âcondor_submit -debugâ. âcondor_qâ does return after several minutes in some cases; in others it throws an error:
-- failed to fetch ads from: <our_scheduler_node_IP_address:51430> : <fqdn_of_same_scheduler_node>
This issue presumably started over the weekend when someone submitted a larger set of jobs (order of magnitude = 10x) than âusual.â When âcondor_qâ does finish, at the end the summary shows the following:
31823 jobs; 0 completed, 31786 removed, 19 idle, 12 running, 24 held, 0 suspended
Iâm posting to see if anyone has insight into how to diagnose why the jobs arenât running. I believe the amount (>33k jobs submitted over three days) isnât unprecedented. Obviously Iâm not a Condor subject-matter expert here, but am trying to grow into something close, by hook or by crook.
Thanks for any and all insights!
Eric