Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] htcondor-sysview
As shown at HTCondor week 2013, the UW CHTC and mwt2.org presents:
htcondor-sysview
https://github.com/DHTC-Tools/htcondor-sysview
-Nate
SYSVIEW README
htcondor-sysview is an efficiency monitor for HTCondor pools and jobs.
05.01.2013 1.13 release. Originally written as Mosaic Sysview by Charles
Waldman and Sarah Williams and Rob Gardner of MWT2.org. Modified by
Rebekah Gietzel (bgietzel@xxxxxxxxxxx) to work with UW-Madison
CHTC pool and HTCondor features including partitionable slots, multiple
pools and submitters. Packaged by Nate Yehle (nyehle@xxxxxxxxxxx)
This 1.13 release should work with most HTCondor pool configs including
static and partionable slots.
The program draws the grid of cpus in HTCondor pools. Each cpu (core) is
one square on the grid. Nodes produce squares based on the # of cpus
listed in their names in the nodes.list file. Jobs are
displayed on a slot basis and map 1:1 by default. When partitionable slots
are used, each square represents a slot. The color of each square
indicates the status of that core and/or node.
Red squares are slots where sysview detects a htcondor startd is not
running correctly.
Efficiency is computed as cputime/walltime of the job running on a slot.
Green squares are slots where efficient jobs are running.
Blue squares are slots where inefficient jobs are running.
Lighter green or blue squares are new jobs trending efficient or
inefficient respectively. As the jobs age and the cputime/walltime ratio
stabilizes the colors darken.
Other multicolored squares are jobs using more than 100% efficiency, as in
multicore jobs. They are represented by only one square showing how one
multicore job prevents other jobs from using the total
number of slots.
Once you have a mosaic output using information about your cluster, try to
drag the mouse across a slot on the mosaic.
Mouseover various squares shows slotname, user, online/down, rss/vm memory
status, cpu time, and efficiency for the current job on the slot. We use
this as an easy way to spot down nodes or jobs which
have low efficiency and are wasting slots. Clicking a slot takes you to
the full dump of condor_q -l for the job running on that core.