Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Limiting memory used on the worker node with c-groups
On 4/24/20 5:16 PM, tpdownes@xxxxxxxxx wrote:
JM-
When somethingÂin the universe goes wrong with HTCondor and CGroups, I
feel a little twitch. When you say the processes are in the "deferred"
state, do you mean they are in the "D" state according to ps? Or do you
mean the actual literal "job deferral" options in "htcondor"?
Hello Tom,
Thank you very much. You are right, I misused the term "deferred",
I was talking about "D" state.
https://support.microfocus.com/kb/doc.php?id=7002725
A common reason for a job getting stuck in D is a bad / overloaded
remote filesystem (NFS, etc.). Is that a possibility here?
Using the command mentioned in the article you mention, I see lines such
as :
ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 | grep sgmali
[...]
30138 30333 sgmali0+ D 87.4 aliroot mem_cgroup_oom_synchronize
30341 30435 sgmali0+ D 0.3 perl mem_cgroup_oom_synchronize
12455 30605 sgmali0+ D 0.0 perl mem_cgroup_oom_synchronize
12594 30869 sgmali0+ D 0.0 perl mem_cgroup_oom_synchronize
FYI: even if you didn't understand my presentation, you made the type of
choice I recommend. Use "soft" but lie a bit about how much RAM you
have. It allows more jobs to match while still ensuring that CGroups can
do its job.
It is always more difficult to fully understand slides if you do not
hear the presenter :-) I hope there is no perceived offense here.
Anyway,
a) these processes in "D" state started to appear after I activated the
"soft" mode on workers, so I think there is a link.
b) I do not exclude the possibility that the jobs themselves are
reacting badly to a signal. These are production jobs of the
LHC ALICE VO and I am only running this VO (no comparison).
c) meanwhile I modified one worker to use the "hard" mode and seems to
behave OK, I did not find removed jobs on this worker in the last
24h or so. This is one point I did not understand : what is the
potential issue with the "hard" mode ?
Thank you.
JM
--
------------------------------------------------------------------------
Jean-michel BARBET | Tel: +33 (0)2 51 85 84 86
Laboratoire SUBATECH Nantes France | Fax: +33 (0)2 51 85 84 79
CNRS-IN2P3/Ecole des Mines/Universite | E-Mail: barbet@xxxxxxxxxxxxxxxxx
------------------------------------------------------------------------