Ok, maybe I just used the subject line to get your attention - but I've honestly found condor_status less useful over time (as our cluster grows and when we switched to p-slots). I wanted to kick off a few ideas on the mail list, and then will file a ticket. I want to use "condor_status" as a mechanism to get an overview of my pool. I often have the following questions about the basic dimensions of the poool: - How large is my pool? - How full is my pool? - How busy is the pool? (I.e., if a slot is claimed, is it also fully utilized by the job?) And questions about nodes in abnormal state: - Are there idle nodes (in the face of large queues)? - Are there poorly utilized nodes? I claim that condor_status does not provide meaningful answers for the above questions. Here are some deficiencies I've noted: 1) Entirely too verbose for large pools. Our site is 3k cores and growing. No human can get process that many lines of output. 2) Does not take into account the significant differences between p-slots and "traditional" slots. - State, LoadAv, Mem, and Activity time have different meanings for p-slots. 3) Does not advertise number of cores in the output. 4) Uses outdated terminology - "Machines" in the summary means "Slots". Slots don't have a significant meaning when the slots are transient. 5) The "Name" column is truncated for every node I own. I have a 300-character wide terminal, yet the Name is limited to about 30 characters! 6) The Arch and OS column is the same for every node I own; quite the waste of space. 7) I can't easily determine if the occupancy rate is reasonable for the current load. A 50% occupied pool is "no problem" if there are no jobs in queue. There might be a big problem if there are 10k jobs in queue. 8) No units on memory numbers! I propose we do not touch condor_status - changing output has too high a chance of breaking user scripts. Instead, I propose we have a new tool - "condor_pool_summary" - which addresses the needs of sysadmins (condor_status is plenty powerful for machine-readable needs). No XML format, no stable output guarantees, maybe limited query semantics, not entirely grep-friendly. I'd start with Matt's work here: http://spinningmatt.wordpress.com/2012/10/01/partitionable-slot-utilization/ and extend it in light of the comments above. Here's a draft output: MACHINE STATES Name Cpus Avail Util% Memory Avail Util% Notes Linux / x86-64 machines at red.hcc.unl.edu slot1@node064 4 0 100% 9.5G 15M 99% slot1@node065 4 0 100% 9.5G 15M 99% slot1@node066 4 0 100% 9.5G 15M 99% slot1@node067 4 0 100% 9.5G 15M 99% slot1@node068 4 0 100% 9.5G 15M 99% Linux / x86-64 machines at sandhills.hcc.unl.edu slot1@red-d11n12 16 1 93% 38G 46M 99% slot1@red-d11n10 16 2 87% 38G 46M 99% slot1@red-d11n11 16 2 87% 38G 46M 99% 2 Retiring slot1@red-d11n13 16 2 87% 38G 46M 99% slot1@red-d11n14 16 2 87% 38G 46M 99% Owner slot1@red-d11n15 16 2 87% 38G 640M 98% slot1@red-d11n1 16 3 81% 38G 46M 99% Draining Windows / x86 machines at foo.hcc.unl.edu slot1@red-d9n3 16 2 87% 38G 48M 99% slot1@red-d9n4 16 2 87% 38G 48M 99% QUEUE SUMMARY ScheddCount RunningJobs IdleJobs HeldJobs 5 3000 10000 3 POOL SUMMARY Cpus Avail Util% Memory Avail Util% Owner Matched Preempting Linux/x86-64 100 4 91% 300G 700M 99% 2 0 0 Windows/x86 32 4 87% 80G 100M 99% 0 0 0 Total 132 8 89% 380G 800M 99% 2 0 0 Notes: - Copy/paste above into a fixed-width-font window or you will go crazy. I didn't bother to make the summary correct - it's a made-up output, don't worry about it. - I would really prefer the utilization numbers to be "live" - not the claimed / total, but actual (CPU used by Condor jobs) / (CPU claimed by Condor jobs). But that might be a bigger project. - Adjust the width of the columns to the size of the data; only truncate when you hit terminal size limits. - Everything gets units and 2 significant figures. - The "Notes" column denotes states or activities that are not "Claimed/Busy" or "Avail/Idle". - Group by OS, arch, and domain (if available). Sort in ascending order of idle slots. Thoughts? Brian |
Attachment:
smime.p7s
Description: S/MIME cryptographic signature