From: Bob Ball <ball@xxxxxxxxx> Date: 03/16/2016 11:00 PM
> It is interesting, we are getting a fair number of OOM Holds these
days,
> but only a few seem to end this way with the WN locked up. One I just
> observed started near the end of what I would call a small storm in
the
> number of kernel process creations on the WN. Typical is around <25/s,
> and this one was running around 600-700/s. I am leaving this
WN "as-is"
> until at least tomorrow should there be anything I could pull out
of
> this for you.
I've seen this sort of thing before too - I think
what might have been happening in the instances where the machine locked
up is that the memory ballooning was happening too quickly for the OOM killer
to cope with, and it ran out of memory itself. That's just a theory,
since I never bothered to peel that onion - I had plenty of other
unrelated things to make me weep.
At the time I was running RHEL6.5. I found that after
I got up to RHEL6.7 - having skipped 6.6 - it stopped happening
and my exec nodes stayed up for months at a time despite the best efforts
of my users. I'm not sure whether that's because they straightened
out their code or there was some fix that improved the cgroup and OOM
killer's reliability and effectiveness, but there you have it.