Re: [Condor-devel] Rethinking resource reporting for jobs

On Jan 9, 2012, at 3:15 PM, Lans Carstensen wrote:

Greetings! On Mon, Jan 9, 2012 at 7:33 AM, Matthew Farrellee
<matt@xxxxxxxxxx> wrote:
On 01/09/2012 10:04 AM, Dan Bradley wrote:

On 1/7/12 2:01 PM, Brian Bockelman wrote:

Hi all,

After seeing a very high profile user's jobs get nuked (the ones
that produce the final pretty plots for CMS) because I can't
enforce reasonable limits on RAM usage, I think it's a reasonable
time to start thinking about how metrics are reported for running
jobs.

Complaint 1: Condor places an emphasis on ImageSize, an almost
meaningless metric.

Agreed.

What I see people really wanting is slots sized to their current job.
ImageSize allows for a rough approximation by making the slot the upper
bound of job memory use.

(In bad form, responding to multiple at once to catch up).

See, I think this is wrong. ImageSize is not a good enough approximation for a job's memory usage. It's almost always too large. In the case of a complex application, it might be hundreds of megabytes too large.

Actually, you can construct realistic cases where it's neither an upper nor lower bound for memory usage. The lower bound argument is obvious; it's not an upper bound of the memory usage if the job relies heavily on the page cache to maintain efficiency. Think of fadvise versus mmap+madvise.

Condor reports the sum of VSS of the processes in the job as the
ImageSize, and displays ImageSize prominently (for example, in the
output of condor_q). ImageSize has little value, especially on
64-bit Linux and modern applications, where things like mmap'ing
files are aggressively used. Because glibc mmap language files on
64-bit platforms, you're almost guaranteed that all processes are a
minimum of ~32MB. As ptmalloc2 (glitch's default malloc
implementation) does not release VSS, the RAM usage and VSS
typically become less related the longer jobs run.

Most administrators actually want to see how much RAM a job is
using, or how much RAM+swap a job is using - i.e., what host
resources is a job using, not the virtualized resource. This is
best measured using cgroups or estimated via PSS.

Side Note: proc_family.cpp has the following code, meaning that the
ImageSize attribute on Win32 is actually the RSS; whoops! Can we
implement the same "bug" for Linux?
#if defined(WIN32)
// ...
imgsize += member->m_proc_info->rssize;
#else
imgsize += member->m_proc_info->imgsize;
#endif

Complaint 2: The schedd writes the following mostly useless updates
into the user log:

<c>
<a n="MyType"><s>JobImageSizeEvent</s></a>
<a n="Cluster">85516</a>
<a n="EventTypeNumber">6</a>
<a n="EventTime"><s>2012-01-04T17:24:37</s></a>
<a n="Proc">0</a>
<a n="Subproc">0</a>
<a n="Size">13000472</a>
<a n="CurrentTime"><e>time()</e></a>
</c>

Why do I say "useless"? I get no indication of the *resource
usage* of the job. Maybe in the single core, 32-bit Linux days
this made sense, but I'd really like to see:
1) RAM usage.

+1

2) CPU usage (think multicore jobs or CPU efficiency).

+1

And for anyone scratching their head thinking loadavg is already
adequate, keep in mind that on Linux a D state (IO-bound) process
shows loadavg=1.

3) Disk usage.

+1, where I'm assuming you mean the per-slot local disk usage. I'd
also like per-slot local disk IO stats. See my note on D-state
processes above, esp. if you have SATA disk deployed on your startd's
vs. SAS.

Actually, per-slot local disk IO stats exist since 7.7.0, if you enable cgroups. You get the absolute numbers, but not the rates.

4) Number of running processes.

5) network usage

+1, although while we're talking about it in a more ideal sense, I'd
also like to see this split out as network IPC vs. network
filesystems/NFS. For those of us using NFS we'd ideally like to see
the active mountpoints associated with the slot and some of the very
awesome per-mountpoint stats available in /proc/self/mountstats. You
could also tie this back to my D-state comments.

An example of network accounting is https://github.com/bbockelm/condor-network-accounting . It comes down to a per-job iptables chain, so anything you can do with iptables can be done: rate limiting or vlan tagging are things that are possible, but I haven't explored.

Was not previously aware of mountstats. I'll work and see if it's possible to report it per-job. "Normal" NFS v3/4 mounts might be fairly approachable, while auto mount might be a difficult beast to tame.

6) removal or expansion of time()
7) limits used (maybe a job can chirp that it is done with one)

+1, we had a "limit expiration" feature in our previous scheduling
software that we really wish we could reproduce in Condor. Say you
need to use licensed software for some brief prep of a job (<5
minutes), and then you have a CPU-intensive activity you will be
performing for >1 hour. You want to bundle the tasks because you
don't want the scheduling overhead of say a 2 minute job that has be
run before every hour job. You want to express the concurrency limit
for the licensed software, but want that limit to expire in, say, 10
minutes. This can make the resource reservation more accurately
represent the license server usage.

I currently get none of these things in the user log.

CPU usage reporting is another unholy mess when querying the
schedd. As far as I know, there's no way to answer "what's the CPU
efficiency of the currently running vanilla-universe job, for this
current invocation?" I will buy someone beers if they can answer
this and not be ashamed to put it into the manual.

Proposal:
1) Deprecate the current ResidentSetSize, ProportionalSetSizeKb
attribute. Replace these with one called "RAMUsageKb", which is
meant to be Condor's best estimate of the RAM used by the job (in
order of increasing accuracy, RSS, PSS, or cgroups measurement).

It seems useful to me to have a "best estimate" but also to have metrics
such as VSIZE, RSS, and PSS that may serve their own purpose.

--Dan

+1 have them all, with clear definitions

+1

I'd suggest also seriously consider deprecating ImageSize, or at
least remove it from prominence in the tool output and
documentation and default Requirements. If it can be measured,
also report SwapUsageKb.

+1, RSS & swap measured independently are far more relevant. And
fundamentally, very few folks understand ImageSize. Everyone I've
worked with here initially thinks it means something other than what
it does, and becomes disappointed - they almost always wanted RSS
instead.

These days, s/RSS/PSS/.

2) Deprecate the JobImageSizeEvent, and introduce a JobUpdateEvent.
This event will contain information about the resource usage of the
job on the host (the things I listed above, plus any new
measurements that we might be able to do in the future).

+1

+1, with the assumption that we can control the list of attributes
sent and manage the amount of per-slot information change.

3) Have a consistent, well-defined set of resource-usage attributes
in the schedd. In particular, these should be well-defined with
respect to:
* When they are updated (during the job run, or during state
transitions?).
* Whether they are cumulative over all job runs, or only reflect
the most current one.
* Units used (kudos to Dan B for calling it ProportionalSetSizeKb!).
* Whether it is the maximum recorded value (ImageSize) or the
latest recorded value (ResidentSetSize).
Bonus beers for placing this knowledge into a table in the manual.
Another bonus if the above information can be encoded into the
attribute names themselves.

As our systems become more constrained by memory than by CPU, I
think having consistent, useful resource reporting metrics will be
effort well-invested for the project.

When you say "in the schedd" do you mean "on the job" or are you thinking
about some sort of roll-up into schedd stats?

I'm assuming he means "on the job", although I could see where a "hero
stat" like loadavg but constructed in a more meaningful way might be
incredible useful in the submitter ads.

Yup, I meant "on the job".

BTW, thanks for starting this conversation Brian. I strongly agree
that we need to get better stats pulled out. We have a pretty neat
resource prediction system in place here, but generally had to abandon
using Condor's stats to seed it for various reasons and now solely
rely upon our own collection at roughly our job wrapper level. It
would be very nice if we could get back to the point where our tools
just query the resource management system.

Why not push it from the job to the schedd via chirp?

And also thanks for
mentioning cgroups - I think that's of critical importance to getting
the right resource container behavior around jobs also.

Again, cgroups has been in since 7.7.0, just undocumented. You may also be interested in this one: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2734.

HTCondor Project List Archives

Re: [Condor-devel] Rethinking resource reporting for jobs