HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] Rethinking resource reporting for jobs



On 01/09/2012 10:04 AM, Dan Bradley wrote:

On 1/7/12 2:01 PM, Brian Bockelman wrote:
Hi all,

After seeing a very high profile user's jobs get nuked (the ones
that produce the final pretty plots for CMS) because I can't
enforce reasonable limits on RAM usage, I think it's a reasonable
time to start thinking about how metrics are reported for running
jobs.

Complaint 1: Condor places an emphasis on ImageSize, an almost
meaningless metric.

Agreed.

What I see people really wanting is slots sized to their current job. ImageSize allows for a rough approximation by making the slot the upper bound of job memory use.


Condor reports the sum of VSS of the processes in the job as the
ImageSize, and displays ImageSize prominently (for example, in the
output of condor_q).  ImageSize has little value, especially on
64-bit Linux and modern applications, where things like mmap'ing
files are aggressively used.  Because glibc mmap language files on
64-bit platforms, you're almost guaranteed that all processes are a
minimum of ~32MB.  As ptmalloc2 (glitch's default malloc
implementation) does not release VSS, the RAM usage and VSS
typically become less related the longer jobs run.

Most administrators actually want to see how much RAM a job is
using, or how much RAM+swap a job is using - i.e., what host
resources is a job using, not the virtualized resource.  This is
best measured using cgroups or estimated via PSS.

Side Note: proc_family.cpp has the following code, meaning that the
ImageSize attribute on Win32 is actually the RSS; whoops!  Can we
implement the same "bug" for Linux?
#if defined(WIN32)
                 // ...
                 imgsize += member->m_proc_info->rssize;
#else
                 imgsize += member->m_proc_info->imgsize;
#endif


Complaint 2: The schedd writes the following mostly useless updates
into the user log:

<c>
     <a n="MyType"><s>JobImageSizeEvent</s></a>
     <a n="Cluster"><i>85516</i></a>
     <a n="EventTypeNumber"><i>6</i></a>
     <a n="EventTime"><s>2012-01-04T17:24:37</s></a>
     <a n="Proc"><i>0</i></a>
     <a n="Subproc"><i>0</i></a>
     <a n="Size"><i>13000472</i></a>
     <a n="CurrentTime"><e>time()</e></a>
</c>


Why do I say "useless"?  I get no indication of the *resource
usage* of the job.  Maybe in the single core, 32-bit Linux days
this made sense, but I'd really like to see:
1) RAM usage.
2) CPU usage (think multicore jobs or CPU efficiency).
3) Disk usage.
4) Number of running processes.

5) network usage
6) removal or expansion of time()
7) limits used (maybe a job can chirp that it is done with one)


I currently get none of these things in the user log.

CPU usage reporting is another unholy mess when querying the
schedd.  As far as I know, there's no way to answer "what's the CPU
efficiency of the currently running vanilla-universe job, for this
current invocation?"  I will buy someone beers if they can answer
this and not be ashamed to put it into the manual.

Proposal:
1) Deprecate the current ResidentSetSize, ProportionalSetSizeKb
attribute.  Replace these with one called "RAMUsageKb", which is
meant to be Condor's best estimate of the RAM used by the job (in
order of increasing accuracy, RSS, PSS, or cgroups measurement).

It seems useful to me to have a "best estimate" but also to have metrics
such as VSIZE, RSS, and PSS that may serve their own purpose.

--Dan

+1 have them all, with clear definitions


I'd suggest also seriously consider deprecating ImageSize, or at
least remove it from prominence in the tool output and
documentation and default Requirements.  If it can be measured,
also report SwapUsageKb.
2) Deprecate the JobImageSizeEvent, and introduce a JobUpdateEvent.
This event will contain information about the resource usage of the
job on the host (the things I listed above, plus any new
measurements that we might be able to do in the future).

+1

3) Have a consistent, well-defined set of resource-usage attributes
in the schedd.  In particular, these should be well-defined with
respect to:
   * When they are updated (during the job run, or during state
     transitions?).
   * Whether they are cumulative over all job runs, or only reflect
     the most current one.
   * Units used (kudos to Dan B for calling it ProportionalSetSizeKb!).
   * Whether it is the maximum recorded value (ImageSize) or the
     latest recorded value (ResidentSetSize).
  Bonus beers for placing this knowledge into a table in the manual.
  Another bonus if the above information can be encoded into the
  attribute names themselves.

As our systems become more constrained by memory than by CPU, I
think having consistent, useful resource reporting metrics will be
effort well-invested for the project.

When you say "in the schedd" do you mean "on the job" or are you thinking about some sort of roll-up into schedd stats?

Best,


matt