HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] Replacing condor_status



inline 

----- Original Message ----- 

> From: "Brian Bockelman" <bbockelm@xxxxxxxxxxx>
> To: "condor-devel@xxxxxxxxxxx Developers" <condor-devel@xxxxxxxxxxx>
> Cc: "Carl Lundstedt" <clundst@xxxxxxxxxxxxxxxx>, "Garhan Attebury"
> <attebury@xxxxxxxxxxx>
> Sent: Thursday, October 11, 2012 8:51:58 AM
> Subject: [Condor-devel] Replacing condor_status

> Ok, maybe I just used the subject line to get your attention - but
> I've honestly found condor_status less useful over time (as our
> cluster grows and when we switched to p-slots). I wanted to kick off
> a few ideas on the mail list, and then will file a ticket.

> I want to use "condor_status" as a mechanism to get an overview of my
> pool. I often have the following questions about the basic
> dimensions of the poool:
> - How large is my pool?
> - How full is my pool?
> - How busy is the pool? (I.e., if a slot is claimed, is it also fully
> utilized by the job?)
> And questions about nodes in abnormal state:
> - Are there idle nodes (in the face of large queues)?
> - Are there poorly utilized nodes?

> I claim that condor_status does not provide meaningful answers for
> the above questions.

> Here are some deficiencies I've noted:
> 1) Entirely too verbose for large pools. Our site is 3k cores and
> growing. No human can get process that many lines of output.
> 2) Does not take into account the significant differences between
> p-slots and "traditional" slots.
> - State, LoadAv, Mem, and Activity time have different meanings for
> p-slots.
> 3) Does not advertise number of cores in the output.
> 4) Uses outdated terminology - "Machines" in the summary means
> "Slots". Slots don't have a significant meaning when the slots are
> transient.
> 5) The "Name" column is truncated for every node I own. I have a
> 300-character wide terminal, yet the Name is limited to about 30
> characters!
> 6) The Arch and OS column is the same for every node I own; quite the
> waste of space.
> 7) I can't easily determine if the occupancy rate is reasonable for
> the current load. A 50% occupied pool is "no problem" if there are
> no jobs in queue. There might be a big problem if there are 10k jobs
> in queue.
> 8) No units on memory numbers!


+1 on pretty much everything above.

> I propose we do not touch condor_status - changing output has too
> high a chance of breaking user scripts.  

> Instead, I propose we have a
> new tool - "condor_pool_summary" - which addresses the needs of
> sysadmins (condor_status is plenty powerful for machine-readable
> needs). No XML format, no stable output guarantees, maybe limited
> query semantics, not entirely grep-friendly. I'd start with Matt's
> work here:
> http://spinningmatt.wordpress.com/2012/10/01/partitionable-slot-utilization/
> and extend it in light of the comments above.

> Here's a draft output:

> MACHINE STATES
> Name Cpus Avail Util% Memory Avail Util% Notes
> Linux / x86-64 machines at red.hcc.unl.edu
> slot1@node064 4 0 100% 9.5G 15M 99%
> slot1@node065 4 0 100% 9.5G 15M 99%
> slot1@node066 4 0 100% 9.5G 15M 99%
> slot1@node067 4 0 100% 9.5G 15M 99%
> slot1@node068 4 0 100% 9.5G 15M 99%
> Linux / x86-64 machines at sandhills.hcc.unl.edu
> slot1@red-d11n12 16 1 93% 38G 46M 99%
> slot1@red-d11n10 16 2 87% 38G 46M 99%
> slot1@red-d11n11 16 2 87% 38G 46M 99% 2 Retiring
> slot1@red-d11n13 16 2 87% 38G 46M 99%
> slot1@red-d11n14 16 2 87% 38G 46M 99% Owner
> slot1@red-d11n15 16 2 87% 38G 640M 98%
> slot1@red-d11n1 16 3 81% 38G 46M 99% Draining
> Windows / x86 machines at foo.hcc.unl.edu
> slot1@red-d9n3 16 2 87% 38G 48M 99%
> slot1@red-d9n4 16 2 87% 38G 48M 99%

> QUEUE SUMMARY
> ScheddCount RunningJobs IdleJobs HeldJobs
> 5 3000 10000 3

> POOL SUMMARY
> Cpus Avail Util% Memory Avail Util% Owner Matched Preempting
> Linux/x86-64 100 4 91% 300G 700M 99% 2 0 0
> Windows/x86 32 4 87% 80G 100M 99% 0 0 0
> Total 132 8 89% 380G 800M 99% 2 0 0

> Notes:
> - Copy/paste above into a fixed-width-font window or you will go
> crazy. I didn't bother to make the summary correct - it's a made-up
> output, don't worry about it.
> - I would really prefer the utilization numbers to be "live" - not
> the claimed / total, but actual (CPU used by Condor jobs) / (CPU
> claimed by Condor jobs). But that might be a bigger project.
> - Adjust the width of the columns to the size of the data; only
> truncate when you hit terminal size limits.
> - Everything gets units and 2 significant figures.
> - The "Notes" column denotes states or activities that are not
> "Claimed/Busy" or "Avail/Idle".
> - Group by OS, arch, and domain (if available). Sort in ascending
> order of idle slots.

> Thoughts?

I like it, and it's accurately named.  I sometimes think we should start shifting the internal nomenclature in reporting to something circa 2000 :-P (nodes, cores, MB, etc.). 

> Brian

> _______________________________________________
> Condor-devel mailing list
> Condor-devel@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-devel