Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] having multiple schedulers and collectors
- Date: Fri, 04 Jun 2010 07:13:02 -0400
- From: Matthew Farrellee <matt@xxxxxxxxxx>
- Subject: Re: [Condor-users] having multiple schedulers and collectors
On 06/03/2010 05:34 PM, Ian Chesal wrote:
> On Tue, Jun 1, 2010 at 1:24 PM, Matthew Farrellee <matt@xxxxxxxxxx
> <mailto:matt@xxxxxxxxxx>> wrote:
>
> There have been discussions about putting some statistical monitors
> into different daemons, including the Schedd, which would be direct
> indicators of backup etc. However, that's just talk at the moment.
>
>
> That would be useful to have so long as the information can be gotten to
> when things are falling apart. :D
>
>
> >From a practical point of view, a few options are:
> 0) You can use the knowledge that the Negotiator puts a Slot into
> the Matched state, then Schedd then puts it into the Claimed/Idle
> state and then into Claimed/Busy, and observe that a large number of
> Matched slots can indicate backup. Claimed/Idle can also be an
> indicator.
> 1) You can look at the exit status of shadows in the Schedd's log
> and why the exits happened in the ShadowLog. Seeing exit code 4 or
> "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT" is an indication
> of a backed up Schedd.
>
>
>
> 2) You can look at response time to a condor_q query that should
> return no result and wouldn't initiate a O(n) scan of the queue.
>
>
> I generally advise against running condor_q if you think your scheduler
> is overloaded. It really only makes things worse. But what kind of
> condor_q query returns nothing and doesn't initiate a queue scan?
> Something like:
>
> condor_q -better-analyze <job>
>
> ?
>
> I can't think of any others that are safe in this case and even the
> above is a call that's not going to help the situation.
>
>
> 3) Looking at the history file (with condor_history), you can
> compare the CompletionDate with the EnteredCurrentStatus and
> JobFinishedHookDone attributes. Any drift can indicate a backup in
> the Schedd.
>
>
> This is assuming you have jobs completing of course. And history turned on.
History is on by default
> 4) Compare the reported count of running jobs in the Schedd
> (condor_q | tail -n1) to the condor_status -schedd. Some small drift
> is acceptable.
>
>
> If you're trying to avoid using condor_q you can look at the
> job_queue.log file instead. If you grep on "JobStatus 2" you'll get a
> list of running jobs to compare to the condor_status view of the world.
job_queue.log is a transaction log and generally you can't just grep it. You'd have to replay it.
Best,
matt
> Hope that helps!
>
> - Ian
>
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/