There have been discussions about putting some statistical monitors into
different daemons, including the Schedd, which would be direct indicators of
backup etc. However, that's just talk at the moment.
From a practical point of view, a few options are:
0) You can use the knowledge that the Negotiator puts a Slot into the
Matched state, then Schedd then puts it into the Claimed/Idle state and then
into Claimed/Busy, and observe that a large number of Matched slots can
indicate backup. Claimed/Idle can also be an indicator.
1) You can look at the exit status of shadows in the Schedd's log and why
the exits happened in the ShadowLog. Seeing exit code 4 or "FAILED TO SEND
INITIAL KEEP ALIVE TO OUR PARENT" is an indication of a backed up Schedd.
2) You can look at response time to a condor_q query that should return no
result and wouldn't initiate a O(n) scan of the queue.
3) Looking at the history file (with condor_history), you can compare the
CompletionDate with the EnteredCurrentStatus and JobFinishedHookDone
attributes. Any drift can indicate a backup in the Schedd.
4) Compare the reported count of running jobs in the Schedd (condor_q |
tail -n1) to the condor_status -schedd. Some small drift is acceptable.
I'll just stop here. 8o)
If the jobs aren't running condor_q -run should not pick them up and show
you ???s in the first place.
Best,
matt
Mag Gam wrote:
is there a way to see if the schedd is backed up? How can I see the
real status of it?
It seems when I submit many jobs (even not running), I get this problem.
On Wed, May 26, 2010 at 2:16 PM, Matthew Farrellee<matt@xxxxxxxxxx>
wrote:
On 05/25/2010 09:05 PM, Mag Gam wrote:
Previously, I had 1 scheduler and 1 collector for 2000 nodes (each
with 16 core) giving me 32000 slots. Everything was functioning fine,
however I used to get a lot of '???????' when I did condor_q -run .
Recently, I added an extra scheduler and a collector to complement my
previous scheduler. I noticed the '?????' is completely gone! Â I was
wondering if there was a relation between this problem and having an
extra collector and scheduler in my pool.
It's entirely possible that the "???"s were because your Schedd was
backed up. It could have marked the jobs as running but info about the
Startd where the job was running had not made its way back yet.
Best,
matt