Re: [HTCondor-devel] cgroups + SWAP


Date: Tue, 10 Sep 2013 10:58:27 +0000
From: "Krieger, Donald N." <kriegerd@xxxxxxxx>
Subject: Re: [HTCondor-devel] cgroups + SWAP
Hi Everyone,
 
The disk subsystems are an important choke point under many circumstances.
For the swapping problem, perhaps it would be helpful to assign the swap space to a small striped disk subsystem.
This would help isolate the slowdowns caused by swapping problems from those caused by the more common data movement slowdowns due to large scale scratch disk usage.
And the disadvantages of a striped system are minimized for swapping functionality.
For instance the data stored in the swap space is all ephemeral so there is no need to mirror it.
 
Regards,
 
Don
 

Don Krieger, Ph.D.

Department of Neurological Surgery

University of Pittsburgh

(412)648-9654 Office

(412)521-4431 Cell/Text

 


From: HTCondor-devel [mailto:htcondor-devel-bounces@xxxxxxxxxxx] On Behalf Of Joan J. Piles
Sent: Tuesday, September 10, 2013 6:43 AM
To: Condor Developers
Subject: Re: [HTCondor-devel] cgroups + SWAP

El 09/09/13 21:11, Todd Tannenbaum escribió:
On 9/9/2013 9:22 AM, Joan J. Piles wrote:
Hi Brian,

Well my idea was more from the POV of a systems guy... Giving the choice
to the users is (for me at least) optional and just a nicety... But I
would like to limit the possibility of a user requesting 1Gb of RAM and
then using over 10Gb, thus making the machine swap like hell and
impacting the rest of users.

If user A has all of its processes in RAM and user B is swapping like hell, is user A actually impacted by user B ?  Or does user B only impact other users that have swapping processes?

Well, heavy swapping (say 8Gb, and we've got well over that) can certainly impact the whole system. For instance, all I/O is almost halted (and this includes even condor activity). We allow interactive logins through condor, and they become completely unusable. Any other job wanting to read o write a file (and most of them do, at some point) will be severely impacted, since swap runs with the highest priority.
I know one option is limiting the swap, but
what I'd rather have a big swap space just in case, and then limit the
available swap to each job (proportionally to the RAM requested, for
instance).


Limiting swap space makes sense in that swap is a shared resource and thus should be requested/managed, but is swap activity and swap size closely correlated?  Seems like what you are really worried about is lots of swap activity slowing down response time for all users of the system, not about exhaustion of swap space itself....
As a general rule we try to avoid OoM situations because they are very unpredictable i.e. the kernel kills the wrong process (not the "culprit") and most often than not some system process is killed and we end up having to physically (or IPMI, thankfully) reboot the machine.

To avoid this we are kind of generous with the swap space, but this doesn't mean we intend all this swap to be used, we just leave it there just in case something unexpected happens. Until now a user's job running amok was one such occurrence, but now that we have cgroups, we'd like to use them to limit this.

I'm aware that one possible solution would be to limit the memsw usage in the htcondor parent cgroup, and this would limit the swap available to all the condor jobs, but I'd rather use a more fine-grained solution where we can define a policy for each job (because there is no documentation on which job the kernel chooses to free up memory when there are more than one going over the limit).

I've already devised a dirty hack using a job wrapper (or hook, to be decided yet) and a setuid binary which would get its own cgroup from /proc/self/cgroup and would tune it accordingly, but as I've said, I think it's a kludge.

With this I am only explaining my use case, but I'm fairly confident there is other people out there which would also find this useful, and that a tunable knob in HTCondor is worth, provided it is easy enough to implement.

Regards,

Joan

regards,
Todd

And I think that making this tunable is just a small step worth it for
the (admittedly few) users that would profit from this.

Regards,

Joan

El 09/09/13 15:49, Brian Bockelman escribió:
On Sep 4, 2013, at 8:54 AM, Joan J. Piles <jpiles@xxxxxxxxx> wrote:

Hi all,

We have recently started using cgroups to limit the RAM usage of our
users. We want to avoid the situation where badly predicted
requirements can bring down the whole machine where the job is
executing, impacting other users' jobs.

HTCondor does a great job using cgroups to achieve this, but I have
the feeling that this can be improved.

Right now, RAM usage is limited whilst swap is not. I am aware that
you can tune this using swappiness parameters, but it is neither
straightforward nor optimum, and furthermore it is something
difficult to do on a per job basis.

Right now HTCondor tunes the memory.limit_in_bytes or
memory.soft_limit_in_bytes files within the cgroup to limit the RAM
usage.

I think HTCondor could provide a "request_swap" parameter in the
submit file (and a RequestSwap associated job ClassAd) that whould be
used to compute the value for memory.memsw.limit_in_bytes (which
would of course be RequestMemory + RequestSwap).

There would also be the associated MODIFY_REQUEST_EXPR_REQUESTSWAP
which could be used (for instance) to limit the amount of swap
reserved to a % of the RAM or to provide a sensible (or even
unlimited) default.

What do you think about this idea? I think it could easily piggyback
on the existing cgroup infrastructure without too much hassle.

Hi Joan,

I'm not too hot on this idea - how does the user know what value to
provide for RequestSwap?  Determining a working set size for an
application is a black art; knowing the memory requirements is hard
enough for most users!

Brian







_______________________________________________
HTCondor-devel mailing list
HTCondor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel





-- 
--------------------------------------------------------------------------
Joan Josep Piles Contreras -  Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------
[← Prev in Thread] Current Thread [Next in Thread→]