[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor monitoring alternatives



On Fri February 8 2008, Brent Strong wrote:
> Here at RIT, we've been working on building up a respectable Condor
> pool with some success.  We're now running into the issue of
> monitoring our clients.  We have enough client machines that it is now
> impossible to visually parse the condor_status output to find
> "stragglers", so I'm looking for an automated solution.
>
> I'm specifically looking for a lightweight alternative to hawkeye,
> possibly something we could integrate into or have as an addition to
> our quick stats look ( http://stats.rc.rit.edu/condor/ ).
>
> Has anyone written a simple script or similar that contains a master
> list of machines that should be up and compares the output of
> condor_status to it?  It seems to be something that would be very
> useful and I'm hoping I can reuse someone else's code.
>
> Ideally, we want a lightweight webpage that shows a list of machines
> (by hostname, IP, whatever) that Condor is installed on and their
> corresponding status (up and running condor, up but not running/
> responding to condor, down).  Combining the output of a ping test,
> condor_status and a master list of machines, these states should be
> easily determined.  My question is: has anyone done this?

Ah, yes.  You should look at the Condor Pool Tools 
http://www.cs.wisc.edu/condor/tools/PoolTools/ and possibly Hawkeye.  The 
pools tools are a set of tools for doing just what you describe, or, at 
least, the chunk of it that does the heavy lifting of knowing what the list 
is, querying the collector, looking for differences, etc.  The current 
tarball that's out there is version 0.1.2, and is woefully out of date (I'm 
the developer of these tools).

Hawkeye can run the pool tools periodically (indeed, there's a 
Hawkeye "module" just for that, and other pool health operations (which is 
run on our pool here at UW)).

Back to the pool tools, you probably really want to start with the latest 
version - let's call it 0.2 - which can't be downloaded at the moment because 
I haven't created a tarball of it.  The main reason to start with it is that 
I did some major changes to the syntax of it's configuration files, and it'd 
seem foolish to have to rewrite them.  Of course, I haven't finished updating 
the documentation on it, yet, either. :(

So, if you're interested, send me an email, and I'll work with you one-on-one 
to get you setup and running with them.  :)

-Nick

-- 
           <<< Why, oh, why, didn't I take the blue pill? >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences