Hello all. I'm running a script that detects the number of jobs running on my submitter. Basically, the script works by calling condor_q and parsing the totals line at the end of the condor_q output.
I originally called condor_q every 3 seconds. After several days of using my script, suddenly, on the 3rd or 4th invocation of condor_q, condor_q would no longer display any jobs, under any status whatsoever. When I would attempt to remove the ghost jobs, condor_rm would exit with the error message that the user's jobs cannot be found.
Once condor_q breaks, it won't report normally for newly created jobs. It will often drop the jobs as a result of a single invocation. Restarting all daemons will sometimes restore condor_q. Sometimes, simply waiting a half a day works.
Has anyone ran into this?