HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] Rethinking resource reporting for jobs



Hi all,

After seeing a very high profile user's jobs get nuked (the ones that produce the final pretty plots for CMS) because I can't enforce reasonable limits on RAM usage, I think it's a reasonable time to start thinking about how metrics are reported for running jobs.

Complaint 1: Condor places an emphasis on ImageSize, an almost meaningless metric.
Condor reports the sum of VSS of the processes in the job as the ImageSize, and displays ImageSize prominently (for example, in the output of condor_q).  ImageSize has little value, especially on 64-bit Linux and modern applications, where things like mmap'ing files are aggressively used.  Because glibc mmap language files on 64-bit platforms, you're almost guaranteed that all processes are a minimum of ~32MB.  As ptmalloc2 (glitch's default malloc implementation) does not release VSS, the RAM usage and VSS typically become less related the longer jobs run.

Most administrators actually want to see how much RAM a job is using, or how much RAM+swap a job is using - i.e., what host resources is a job using, not the virtualized resource.  This is best measured using cgroups or estimated via PSS.

Side Note: proc_family.cpp has the following code, meaning that the ImageSize attribute on Win32 is actually the RSS; whoops!  Can we implement the same "bug" for Linux?
#if defined(WIN32)
                // ...
                imgsize += member->m_proc_info->rssize;
#else
                imgsize += member->m_proc_info->imgsize;
#endif


Complaint 2: The schedd writes the following mostly useless updates into the user log:

<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="Cluster"><i>85516</i></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2012-01-04T17:24:37</s></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>13000472</i></a>
    <a n="CurrentTime"><e>time()</e></a>
</c>


Why do I say "useless"?  I get no indication of the *resource usage* of the job.  Maybe in the single core, 32-bit Linux days this made sense, but I'd really like to see:
1) RAM usage.
2) CPU usage (think multicore jobs or CPU efficiency).
3) Disk usage.
4) Number of running processes.

I currently get none of these things in the user log.

CPU usage reporting is another unholy mess when querying the schedd.  As far as I know, there's no way to answer "what's the CPU efficiency of the currently running vanilla-universe job, for this current invocation?"  I will buy someone beers if they can answer this and not be ashamed to put it into the manual.

Proposal:
1) Deprecate the current ResidentSetSize, ProportionalSetSizeKb attribute.  Replace these with one called "RAMUsageKb", which is meant to be Condor's best estimate of the RAM used by the job (in order of increasing accuracy, RSS, PSS, or cgroups measurement).  I'd suggest also seriously consider deprecating ImageSize, or at least remove it from prominence in the tool output and documentation and default Requirements.  If it can be measured, also report SwapUsageKb.
2) Deprecate the JobImageSizeEvent, and introduce a JobUpdateEvent.  This event will contain information about the resource usage of the job on the host (the things I listed above, plus any new measurements that we might be able to do in the future).
3) Have a consistent, well-defined set of resource-usage attributes in the schedd.  In particular, these should be well-defined with respect to:
   * When they are updated (during the job run, or during state transitions?).
   * Whether they are cumulative over all job runs, or only reflect the most current one.
   * Units used (kudos to Dan B for calling it ProportionalSetSizeKb!).
   * Whether it is the maximum recorded value (ImageSize) or the latest recorded value (ResidentSetSize).
   Bonus beers for placing this knowledge into a table in the manual.  Another bonus if the above information can be encoded into the attribute names themselves.

As our systems become more constrained by memory than by CPU, I think having consistent, useful resource reporting metrics will be effort well-invested for the project.

Brian

Attachment: smime.p7s
Description: S/MIME cryptographic signature