Re: [Condor-devel] Replacing condor

Re: [Condor-devel] Replacing condor_status

Date: Tue, 23 Oct 2012 11:43:57 -0400

From: Peter MacKinnon <pmackinn@xxxxxxxxxx>

Subject: Re: [Condor-devel] Replacing condor_status

On 10/11/2012 09:51 AM, Brian Bockelman wrote:

Ok, maybe I just used the subject line to get your attention - but I've honestly found condor_status less useful over time (as our cluster grows and when we switched to p-slots). I wanted to kick off a few ideas on the mail list, and then will file a ticket.

I want to use "condor_status" as a mechanism to get an overview of my pool. I often have the following questions about the basic dimensions of the poool:

- How large is my pool?

- How full is my pool?

- How busy is the pool? (I.e., if a slot is claimed, is it also fully utilized by the job?)

And questions about nodes in abnormal state:

- Are there idle nodes (in the face of large queues)?

- Are there poorly utilized nodes?

I claim that condor_status does not provide meaningful answers for the above questions.

Here are some deficiencies I've noted:

1) Entirely too verbose for large pools. Our site is 3k cores and growing. No human can get process that many lines of output.

2) Does not take into account the significant differences between p-slots and "traditional" slots.

- State, LoadAv, Mem, and Activity time have different meanings for p-slots.

3) Does not advertise number of cores in the output.

4) Uses outdated terminology - "Machines" in the summary means "Slots". Slots don't have a significant meaning when the slots are transient.

5) The "Name" column is truncated for every node I own. I have a 300-character wide terminal, yet the Name is limited to about 30 characters!

6) The Arch and OS column is the same for every node I own; quite the waste of space.

7) I can't easily determine if the occupancy rate is reasonable for the current load. A 50% occupied pool is "no problem" if there are no jobs in queue. There might be a big problem if there are 10k jobs in queue.

8) No units on memory numbers!

I propose we do not touch condor_status - changing output has too high a chance of breaking user scripts. Instead, I propose we have a new tool - "condor_pool_summary" - which addresses the needs of sysadmins (condor_status is plenty powerful for machine-readable needs). No XML format, no stable output guarantees, maybe limited query semantics, not entirely grep-friendly. I'd start with Matt's work here: http://spinningmatt.wordpress.com/2012/10/01/partitionable-slot-utilization/ and extend it in light of the comments above.

Here's a draft output:

MACHINE STATES

Name Cpus Avail Util% Memory Avail Util% Notes

Linux / x86-64 machines at red.hcc.unl.edu

slot1@node064 4 0 100% 9.5G 15M 99%

slot1@node065 4 0 100% 9.5G 15M 99%

slot1@node066 4 0 100% 9.5G 15M 99%

slot1@node067 4 0 100% 9.5G 15M 99%

slot1@node068 4 0 100% 9.5G 15M 99%

Linux / x86-64 machines at sandhills.hcc.unl.edu

slot1@red-d11n12 16 1 93% 38G 46M 99%

slot1@red-d11n10 16 2 87% 38G 46M 99%

slot1@red-d11n11 16 2 87% 38G 46M 99% 2 Retiring

slot1@red-d11n13 16 2 87% 38G 46M 99%

slot1@red-d11n14 16 2 87% 38G 46M 99% Owner

slot1@red-d11n15 16 2 87% 38G 640M 98%

slot1@red-d11n1 16 3 81% 38G 46M 99% Draining

Windows / x86 machines at foo.hcc.unl.edu

slot1@red-d9n3 16 2 87% 38G 48M 99%

slot1@red-d9n4 16 2 87% 38G 48M 99%

QUEUE SUMMARY

ScheddCount RunningJobs IdleJobs HeldJobs

5 3000 10000 3

POOL SUMMARY

Cpus Avail Util% Memory Avail Util% Owner Matched Preempting

Linux/x86-64 100 4 91% 300G 700M 99% 2 0 0

Windows/x86 32 4 87% 80G 100M 99% 0 0 0

Total 132 8 89% 380G 800M 99% 2 0 0

Notes:

- Copy/paste above into a fixed-width-font window or you will go crazy. I didn't bother to make the summary correct - it's a made-up output, don't worry about it.

- I would really prefer the utilization numbers to be "live" - not the claimed / total, but actual (CPU used by Condor jobs) / (CPU claimed by Condor jobs). But that might be a bigger project.

- Adjust the width of the columns to the size of the data; only truncate when you hit terminal size limits.

- Everything gets units and 2 significant figures.

- The "Notes" column denotes states or activities that are not "Claimed/Busy" or "Avail/Idle".

- Group by OS, arch, and domain (if available). Sort in ascending order of idle slots.

Thoughts?

Brian
_______________________________________________
Condor-devel mailing list
Condor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-devel

These are cool ideas. Since the plumage contrib already captures the raw classads we developed a similar approach using a pymongo script to give a quick admin summary (also influenced by some of Matt's condor_status adventures).# ./plumage_utilizationLocal calculation of slot totals...Total Slots: 306Used: 254Unused: 50Owner: 2Utilization: 83.55%# ./plumage_utilization -f '2012-09-27 12:00' -t '2012-09-27 12:14' -STIMESTAMP TOTAL USED UNUSED OWNER EFF % 2012-09-27 16:00:11.419000 334 161 171 2 48.492012-09-27 16:01:11.006000 334 164 168 2 49.402012-09-27 16:02:12.311000 334 138 193 3 41.692012-09-27 16:03:12.460000 334 97 234 3 29.312012-09-27 16:04:12.518000 334 190 142 2 57.232012-09-27 16:05:12.180000 334 108 223 3 32.632012-09-27 16:06:12.066000 334 98 232 4 29.702012-09-27 16:07:12.637000 334 92 238 4 27.882012-09-27 16:08:12.104000 334 92 240 2 27.712012-09-27 16:09:12.841000 334 93 239 2 28.012012-09-27 16:10:12.089000 334 290 42 2 87.352012-09-27 16:11:12.004000 334 301 30 3 90.942012-09-27 16:12:13.116000 334 186 145 3 56.192012-09-27 16:13:13.103000 334 108 224 2 32.53\Pete

-- 
Peter MacKinnon
Cloud BU/MRG Grid
Red Hat Inc.
Raleigh, NC

HTCondor Project List Archives

Re: [Condor-devel] Replacing condor_status