HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] Rethinking resource reporting for jobs



Greetings!  On Mon, Jan 9, 2012 at 7:33 AM, Matthew Farrellee
<matt@xxxxxxxxxx> wrote:
> On 01/09/2012 10:04 AM, Dan Bradley wrote:
>
>> On 1/7/12 2:01 PM, Brian Bockelman wrote:
>>>
>>> Hi all,
>>>
>>> After seeing a very high profile user's jobs get nuked (the ones
>>> that produce the final pretty plots for CMS) because I can't
>>> enforce reasonable limits on RAM usage, I think it's a reasonable
>>> time to start thinking about how metrics are reported for running
>>> jobs.
>>>
>>> Complaint 1: Condor places an emphasis on ImageSize, an almost
>>> meaningless metric.
>>
>>
>> Agreed.
>
>
> What I see people really wanting is slots sized to their current job.
> ImageSize allows for a rough approximation by making the slot the upper
> bound of job memory use.
>
>
>
>>> Condor reports the sum of VSS of the processes in the job as the
>>> ImageSize, and displays ImageSize prominently (for example, in the
>>> output of condor_q).  ImageSize has little value, especially on
>>> 64-bit Linux and modern applications, where things like mmap'ing
>>> files are aggressively used.  Because glibc mmap language files on
>>> 64-bit platforms, you're almost guaranteed that all processes are a
>>> minimum of ~32MB.  As ptmalloc2 (glitch's default malloc
>>> implementation) does not release VSS, the RAM usage and VSS
>>> typically become less related the longer jobs run.
>>>
>>> Most administrators actually want to see how much RAM a job is
>>> using, or how much RAM+swap a job is using - i.e., what host
>>> resources is a job using, not the virtualized resource.  This is
>>> best measured using cgroups or estimated via PSS.
>>>
>>> Side Note: proc_family.cpp has the following code, meaning that the
>>> ImageSize attribute on Win32 is actually the RSS; whoops!  Can we
>>> implement the same "bug" for Linux?
>>> #if defined(WIN32)
>>>                 // ...
>>>                 imgsize += member->m_proc_info->rssize;
>>> #else
>>>                 imgsize += member->m_proc_info->imgsize;
>>> #endif
>>>
>>>
>>> Complaint 2: The schedd writes the following mostly useless updates
>>> into the user log:
>>>
>>> <c>
>>>     <a n="MyType"><s>JobImageSizeEvent</s></a>
>>>     <a n="Cluster"><i>85516</i></a>
>>>     <a n="EventTypeNumber"><i>6</i></a>
>>>     <a n="EventTime"><s>2012-01-04T17:24:37</s></a>
>>>     <a n="Proc"><i>0</i></a>
>>>     <a n="Subproc"><i>0</i></a>
>>>     <a n="Size"><i>13000472</i></a>
>>>     <a n="CurrentTime"><e>time()</e></a>
>>> </c>
>>>
>>>
>>> Why do I say "useless"?  I get no indication of the *resource
>>> usage* of the job.  Maybe in the single core, 32-bit Linux days
>>> this made sense, but I'd really like to see:
>>> 1) RAM usage.

+1

>>> 2) CPU usage (think multicore jobs or CPU efficiency).

+1

And for anyone scratching their head thinking loadavg is already
adequate, keep in mind that on Linux a D state (IO-bound) process
shows loadavg=1.

>>> 3) Disk usage.

+1, where I'm assuming you mean the per-slot local disk usage.  I'd
also like per-slot local disk IO stats.  See my note on D-state
processes above, esp. if you have SATA disk deployed on your startd's
vs. SAS.

>>> 4) Number of running processes.
>
>
> 5) network usage

+1, although while we're talking about it in a more ideal sense, I'd
also like to see this split out as network IPC vs. network
filesystems/NFS.  For those of us using NFS we'd ideally like to see
the active mountpoints associated with the slot and some of the very
awesome per-mountpoint stats available in /proc/self/mountstats.  You
could also tie this back to my D-state comments.

> 6) removal or expansion of time()
> 7) limits used (maybe a job can chirp that it is done with one)

+1, we had a "limit expiration" feature in our previous scheduling
software that we really wish we could reproduce in Condor.  Say you
need to use licensed software for some brief prep of a job (<5
minutes), and then you have a CPU-intensive activity you will be
performing for >1 hour.  You want to bundle the tasks because you
don't want the scheduling overhead of say a 2 minute job that has be
run before every hour job.  You want to express the concurrency limit
for the licensed software, but want that limit to expire in, say, 10
minutes.  This can make the resource reservation more accurately
represent the license server usage.

>
>>> I currently get none of these things in the user log.
>>>
>>> CPU usage reporting is another unholy mess when querying the
>>> schedd.  As far as I know, there's no way to answer "what's the CPU
>>> efficiency of the currently running vanilla-universe job, for this
>>> current invocation?"  I will buy someone beers if they can answer
>>> this and not be ashamed to put it into the manual.
>>>
>>> Proposal:
>>> 1) Deprecate the current ResidentSetSize, ProportionalSetSizeKb
>>> attribute.  Replace these with one called "RAMUsageKb", which is
>>> meant to be Condor's best estimate of the RAM used by the job (in
>>> order of increasing accuracy, RSS, PSS, or cgroups measurement).
>>
>>
>> It seems useful to me to have a "best estimate" but also to have metrics
>> such as VSIZE, RSS, and PSS that may serve their own purpose.
>>
>> --Dan
>
>
> +1 have them all, with clear definitions

+1

>>> I'd suggest also seriously consider deprecating ImageSize, or at
>>> least remove it from prominence in the tool output and
>>> documentation and default Requirements.  If it can be measured,
>>> also report SwapUsageKb.

+1, RSS & swap measured independently are far more relevant.  And
fundamentally, very few folks understand ImageSize.  Everyone I've
worked with here initially thinks it means something other than what
it does, and becomes disappointed - they almost always wanted RSS
instead.

>>> 2) Deprecate the JobImageSizeEvent, and introduce a JobUpdateEvent.
>>> This event will contain information about the resource usage of the
>>> job on the host (the things I listed above, plus any new
>>> measurements that we might be able to do in the future).
>
>
> +1

+1, with the assumption that we can control the list of attributes
sent and manage the amount of per-slot information change.

>
>>> 3) Have a consistent, well-defined set of resource-usage attributes
>>> in the schedd.  In particular, these should be well-defined with
>>> respect to:
>>>   * When they are updated (during the job run, or during state
>>>     transitions?).
>>>   * Whether they are cumulative over all job runs, or only reflect
>>>     the most current one.
>>>   * Units used (kudos to Dan B for calling it ProportionalSetSizeKb!).
>>>   * Whether it is the maximum recorded value (ImageSize) or the
>>>     latest recorded value (ResidentSetSize).
>>>  Bonus beers for placing this knowledge into a table in the manual.
>>>  Another bonus if the above information can be encoded into the
>>>  attribute names themselves.
>>>
>>> As our systems become more constrained by memory than by CPU, I
>>> think having consistent, useful resource reporting metrics will be
>>> effort well-invested for the project.
>
>
> When you say "in the schedd" do you mean "on the job" or are you thinking
> about some sort of roll-up into schedd stats?

I'm assuming he means "on the job", although I could see where a "hero
stat" like loadavg but constructed in a more meaningful way might be
incredible useful in the submitter ads.

BTW, thanks for starting this conversation Brian.  I strongly agree
that we need to get better stats pulled out.  We have a pretty neat
resource prediction system in place here, but generally had to abandon
using Condor's stats to seed it for various reasons and now solely
rely upon our own collection at roughly our job wrapper level.  It
would be very nice if we could get back to the point where our tools
just query the resource management system.  And also thanks for
mentioning cgroups - I think that's of critical importance to getting
the right resource container behavior around jobs also.

-- Lans Carstensen