Hi again,
Also you may want to place your memory limit policy on the
execute nodes via startd policy _expression_, instead of having
them enforced on the submit machine (what I think you are
calling the head node). The reason is the execute node policy
is evaluated every five seconds, while the submit machine policy
is evaluated every several minutes.Â
I read that the submit machine evaluates the _expression_ every 60
seconds since version 7.4 (though admitedly the blog I read is
quite old so things might have changed again (https://spinningmatt.wordpress.com/2009/12/05/cap-job-runtime-debugging-periodic-job-policy-in-a-condor-pool/)
I'm trying to look at both ResidentSetSize_RAW and MemoryUsage on
schedd machine and it does actually take a full 15 minutes before
either gets a value assigned (unless I misunderstood the time
attributes)
condor_q -autoformat MemoryUsage ResidentSetSize_RAW
ClusterId Owner '(ServerTime-LastMatchTime)/60' |Â sort -rnk5
[....]
1221 1104640 101393 atlprd002 22
undefined undefined 101767 atlprd007 15
undefined undefined 101409 atlprd002 8
undefined undefined 101779 atlpil017 4
[....]
A
runaway job could consume a lot of memory in a few minutes :).
Do you mean I should move SYSTEM_PERIODIC_REMOVE to the WN? or is
there another recipe? This I wrote recipe I'm using is used by
several other sites.
2) Shouldn't htcondor set the job soft
limit with this configuration? or is the site expected to
set the soft limit separately?
Personally, I think "soft" limits in cgroups are completely
bogus. The way the Linux kernel treats soft limits does not do
in practice what anyone (including htcondor itself) expects. I
recommend settings CGROUP_MEMORY_LIMIT to either none or hard,
soft makes no sense imho.
"CGROUP_MEMORY_LIMIT=hard" is clear to understand: if the job
uses more memory than it requested, it is __immediately__ kicked
off and put on hold. This way users get a consistent
experience.
If you want jobs to be able to go over their requested memory so
long as the machine isn't swapping, consider disabling swap on
your execute nodes (not a bad idea for compute servers in
general) and simply leaving "CGROUP_MEMORY_LIMIT=none". What
will happen is if the system is stressed, eventually the Linux
OOM (out of memory killer) will kick in and pick a process to
kill.Â
at the moment there are no limits set in cgroups, i.e. the limit
number is practically infinite, so either policy - soft or hard -
might not work without (OOM doesn't kick in). This is why sites
are setting the system_periodic_remove. The machines were stressed
because the application was using up to 15 times what it
requested. For example using stress I just submitted a job that
usses 80GB of memory on a machine that has 64GB RAM
[root@wn2208290 ~]# for a in $(seq -w 1 50); do
egrep '^rss|^swap'
/sys/fs/cgroup/memory/system.slice/condor.service/condor_scratch_condor_pool_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/memory.stat|grep
-v huge; sleep 5 ;echo;done
rss 65429585920
swap 457154560
[..]
rss 65468846080
swap 1864413184
it is happily filling the swap. I don't think removing the swap is
a good idea but the sum RAM+swap should be indeed limited to
either a multiple of what is requested or a default max limit. If
I put a soft limit to 4GB in the job it does bring down the memory
to 40GB but the limit doesn't affect the swap which starts
increasing at a faster pace.
[root@wn2208290 ~]# for a in $(seq -w 1 50); do
egrep '^rss|^swap'
/sys/fs/cgroup/memory/system.slice/condor.service/condor_scratch_condor_pool_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/memory.stat|grep
-v huge; sleep 5 ;echo;done
rss 64724119552
swap 16926076928
[....]
rss 64367165440
swap 21876707328
however the soft limit is the only thing it lets me set with a
brutal echo redirection. The memory general limit and the memsw
limit give error
echo 4G >
/sys/fs/cgroup/memory/system.slice/condor.service/condor_scratch_condor_pool_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/memory.limit_in_bytes
-bash: echo: write error: Device or resource busy
echo 4G >
/sys/fs/cgroup/memory/system.slice/condor.service/condor_scratch_condor_pool_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/memory.memsw.limit_in_bytes
-bash: echo: write error: Invalid argument
so my plan to set this stuff on the fly doesn't seem feasible. I
wonder if any of these condor daemons that actually create the
condor jobs process groups could do that at creation time? I'm
just at the start with cgroups reading the docs I thought things
where quite straightforward but now I'm confused on how it works.
cheers
alessandra
HTCondor
sets the OOM priority of job process such that the OOM killer
should always pick job processes ahead of other processes on the
system. Furthermore, HTCondor "captures" the OOM request to
kill a job and only allows it to continue if the job is indeed
using more memory than requested (i.e. provisioned in the slot).
This is probably what you wanted by setting the limit to soft in
the first place.
I am thinking we should remove the "soft" option to
CGROUP_MEMORY_LIMIT in future releases, it just causes confusion
imho. Curious if others on the list disagree...
Hope the above helps,
regards,
Todd
--
Respect is a rational process. \\//
Fatti non foste a viver come bruti, ma per seguir virtute e canoscenza(Dante)
For Ur-Fascism, disagreement is treason. (U. Eco)
But but but her emails... covfefe!
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Respect is a rational process. \\//
Fatti non foste a viver come bruti, ma per seguir virtute e canoscenza(Dante)
For Ur-Fascism, disagreement is treason. (U. Eco)
But but but her emails... covfefe!
|