HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] Replacing condor_status




Ok, maybe I just used the subject line to get your attention - but I've honestly found condor_status less useful over time (as our cluster grows and when we switched to p-slots).  I wanted to kick off a few ideas on the mail list, and then will file a ticket.

I want to use "condor_status" as a mechanism to get an overview of my pool.  I often have the following questions about the basic dimensions of the poool:
- How large is my pool?
- How full is my pool?
- How busy is the pool?  (I.e., if a slot is claimed, is it also fully utilized by the job?)
And questions about nodes in abnormal state:
- Are there idle nodes (in the face of large queues)?
- Are there poorly utilized nodes?

I claim that condor_status does not provide meaningful answers for the above questions.

Here are some deficiencies I've noted:
1) Entirely too verbose for large pools.  Our site is 3k cores and growing.  No human can get process that many lines of output.
2) Does not take into account the significant differences between p-slots and "traditional" slots.
  - State, LoadAv, Mem, and Activity time have different meanings for p-slots.
3) Does not advertise number of cores in the output.
4) Uses outdated terminology - "Machines" in the summary means "Slots".  Slots don't have a significant meaning when the slots are transient.
5) The "Name" column is truncated for every node I own.  I have a 300-character wide terminal, yet the Name is limited to about 30 characters!
6) The Arch and OS column is the same for every node I own; quite the waste of space.
7) I can't easily determine if the occupancy rate is reasonable for the current load.  A 50% occupied pool is "no problem" if there are no jobs in queue.  There might be a big problem if there are 10k jobs in queue.
8) No units on memory numbers!

I propose we do not touch condor_status - changing output has too high a chance of breaking user scripts.  Instead, I propose we have a new tool - "condor_pool_summary" - which addresses the needs of sysadmins (condor_status is plenty powerful for machine-readable needs).  No XML format, no stable output guarantees, maybe limited query semantics, not entirely grep-friendly.  I'd start with Matt's work here: http://spinningmatt.wordpress.com/2012/10/01/partitionable-slot-utilization/ and extend it in light of the comments above.

Here's a draft output:

MACHINE STATES
              Name  Cpus Avail Util%  Memory Avail Util% Notes
Linux / x86-64 machines at red.hcc.unl.edu
     slot1@node064     4     0  100%    9.5G   15M   99%
     slot1@node065     4     0  100%    9.5G   15M   99%
     slot1@node066     4     0  100%    9.5G   15M   99%
     slot1@node067     4     0  100%    9.5G   15M   99%
     slot1@node068     4     0  100%    9.5G   15M   99%
Linux / x86-64 machines at sandhills.hcc.unl.edu
  slot1@red-d11n12    16     1   93%     38G   46M   99%
  slot1@red-d11n10    16     2   87%     38G   46M   99% 
  slot1@red-d11n11    16     2   87%     38G   46M   99% 2 Retiring
  slot1@red-d11n13    16     2   87%     38G   46M   99%
  slot1@red-d11n14    16     2   87%     38G   46M   99% Owner
  slot1@red-d11n15    16     2   87%     38G  640M   98%
   slot1@red-d11n1    16     3   81%     38G   46M   99% Draining
Windows / x86 machines at foo.hcc.unl.edu
    slot1@red-d9n3    16     2   87%     38G   48M   99%
    slot1@red-d9n4    16     2   87%     38G   48M   99%

QUEUE SUMMARY
ScheddCount RunningJobs IdleJobs HeldJobs 
          5        3000    10000        3

POOL SUMMARY
               Cpus Avail Util%  Memory Avail Util% Owner Matched Preempting 
Linux/x86-64    100     4   91%    300G  700M   99%     2       0          0
Windows/x86      32     4   87%     80G  100M   99%     0       0          0
Total           132     8   89%    380G  800M   99%     2       0          0

Notes:
- Copy/paste above into a fixed-width-font window or you will go crazy.  I didn't bother to make the summary correct - it's a made-up output, don't worry about it.
- I would really prefer the utilization numbers to be "live" - not the claimed / total, but actual (CPU used by Condor jobs) / (CPU claimed by Condor jobs).  But that might be a bigger project.
- Adjust the width of the columns to the size of the data; only truncate when you hit terminal size limits.
- Everything gets units and 2 significant figures.
- The "Notes" column denotes states or activities that are not "Claimed/Busy" or "Avail/Idle".
- Group by OS, arch, and domain (if available).  Sort in ascending order of idle slots.

Thoughts?

Brian

Attachment: smime.p7s
Description: S/MIME cryptographic signature