Ok, maybe I just used the subject line to get your
attention - but I've honestly found condor_status less useful
over time (as our cluster grows and when we switched to
p-slots). I wanted to kick off a few ideas on the mail list,
and then will file a ticket.
I want to use "condor_status" as a mechanism to get an
overview of my pool. I often have the following questions
about the basic dimensions of the poool:
- How large is my pool?
- How full is my pool?
- How busy is the pool? (I.e., if a slot is claimed, is
it also fully utilized by the job?)
And questions about nodes in abnormal state:
- Are there idle nodes (in the face of large queues)?
- Are there poorly utilized nodes?
I claim that condor_status does not provide meaningful
answers for the above questions.
Here are some deficiencies I've noted:
1) Entirely too verbose for large pools. Our site is 3k
cores and growing. No human can get process that many lines
of output.
2) Does not take into account the significant differences
between p-slots and "traditional" slots.
- State, LoadAv, Mem, and Activity time have different
meanings for p-slots.
3) Does not advertise number of cores in the output.
4) Uses outdated terminology - "Machines" in the summary
means "Slots". Slots don't have a significant meaning when
the slots are transient.
5) The "Name" column is truncated for every node I own.
I have a 300-character wide terminal, yet the Name is limited
to about 30 characters!
6) The Arch and OS column is the same for every node I
own; quite the waste of space.
7) I can't easily determine if the occupancy rate is
reasonable for the current load. A 50% occupied pool is "no
problem" if there are no jobs in queue. There might be a big
problem if there are 10k jobs in queue.
8) No units on memory numbers!
I propose we do not touch condor_status - changing output
has too high a chance of breaking user scripts. Instead, I
propose we have a new tool - "condor_pool_summary" - which
addresses the needs of sysadmins (condor_status is plenty
powerful for machine-readable needs). No XML format, no
stable output guarantees, maybe limited query semantics, not
entirely grep-friendly. I'd start with Matt's work here: http://spinningmatt.wordpress.com/2012/10/01/partitionable-slot-utilization/ and
extend it in light of the comments above.
Here's a draft output:
MACHINE STATES
Name Cpus Avail Util% Memory
Avail Util% Notes
slot1@node064 4 0 100% 9.5G
15M 99%
slot1@node065 4 0 100% 9.5G
15M 99%
slot1@node066 4 0 100% 9.5G
15M 99%
slot1@node067 4 0 100% 9.5G
15M 99%
slot1@node068 4 0 100% 9.5G
15M 99%
slot1@red-d11n12 16 1 93% 38G
46M 99%
slot1@red-d11n10 16 2 87% 38G
46M 99%
slot1@red-d11n11 16 2 87% 38G
46M 99% 2 Retiring
slot1@red-d11n13 16 2 87% 38G
46M 99%
slot1@red-d11n14 16 2 87% 38G
46M 99% Owner
slot1@red-d11n15 16 2 87% 38G
640M 98%
slot1@red-d11n1 16 3 81% 38G
46M 99% Draining
slot1@red-d9n3 16 2 87% 38G
48M 99%
slot1@red-d9n4 16 2 87% 38G
48M 99%
QUEUE SUMMARY
ScheddCount RunningJobs IdleJobs HeldJobs
5 3000 10000 3
POOL SUMMARY
Cpus Avail Util% Memory Avail
Util% Owner Matched Preempting
Linux/x86-64 100 4 91% 300G 700M
99% 2 0 0
Windows/x86 32 4 87% 80G 100M
99% 0 0 0
Total 132 8 89% 380G 800M
99% 2 0 0
Notes:
- Copy/paste above into a fixed-width-font window or you
will go crazy. I didn't bother to make the summary correct -
it's a made-up output, don't worry about it.
- I would really prefer the utilization numbers to be
"live" - not the claimed / total, but actual (CPU used by
Condor jobs) / (CPU claimed by Condor jobs). But that might
be a bigger project.
- Adjust the width of the columns to the size of the
data; only truncate when you hit terminal size limits.
- Everything gets units and 2 significant figures.
- The "Notes" column denotes states or activities that
are not "Claimed/Busy" or "Avail/Idle".
- Group by OS, arch, and domain (if available). Sort in
ascending order of idle slots.
Thoughts?
Brian
_______________________________________________
Condor-devel mailing list
Condor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-devel