[HTCondor-devel] OoM detection bug (Re: [HTCondor-users] out-of-memory event?)


Date: Thu, 26 Oct 2017 16:13:10 +0200
From: Joan Josep Piles-Contreras <jpiles@xxxxxxxxxxxxxxxx>
Subject: [HTCondor-devel] OoM detection bug (Re: [HTCondor-users] out-of-memory event?)
Sorry for the cross-posting.

Well, I think I know the reason for this bug... Since I'm afraid I contributed this piece of code [1].

The cgroup OoM event is triggered also when a cgroup is being deleted, and condor uses a shortcut to know if it's in a true OoM event, or in a "fake" one created at the destruction of the cgroup (so, in a proper job finish).

This shortcut is to check if there are still processes in the process group: it's assumed that if the job has finished, by the time condor reaches this point that all the processes have been killed. As I said back then, it didn't look too clean, but it did the job.

Now the problem is (and that's something that we've also experienced) that sometimes a job might end while leaving around "stray" processes, either because they couldn't be killed or because of some other reason (daemon processes perhaps?). In this case it will be wrongly processed as a OoM event.

It could be that the bug is that those processes are not killed when they should, but there might be situations where not even "kill -9" does the trick (i.e. i/o stalled processes). A better mechanism for detecting a true OoM situation would be needed, but regrettably I'm not confident enough to suggest an alternative.

I hope somebody knows cgroups better than me and can say which would be the proper check in this case.

Best,

Joan

[1]: https://www-auth.cs.wisc.edu/lists/htcondor-users/2013-July/msg00103.shtml

On 10/26/2017 03:10 PM, Michael Di Domenico wrote:
On Thu, Oct 12, 2017 at 10:33 AM, Michael Di Domenico
<mdidomenico4@xxxxxxxxx> wrote:
found further evidence in the starterlog

"job was held due to OOM event: job has encountered an out-of-memory event"

however, when i look through the system logs, the OOM killer doesn't
seem to have killed anything.

turns out this might have been a bit of a red herring.  after several
days i finally tracked down that the jobs were failing on only a few
specific hosts and at exactly the same time everyday.  turns out there
is a cronjob on those machines that does 'systemctl restart
gdm.service'

it's not clear exactly why restarting gdm kills off the jobs, it's
also not clear why condor thinks this is an out-of-memory event.  my
own supposition is that condor assumes if a job is killed by someone
other then itself it must have been OOM.  but i don't know the code
i'm likely wrong
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Dr. Joan Josep Piles-Contreras
ZWE Scientific Computing
Max Planck Institute for Intelligent Systems
(p) +49 7071 601 1750

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

[← Prev in Thread] Current Thread [Next in Thread→]
  • [HTCondor-devel] OoM detection bug (Re: [HTCondor-users] out-of-memory event?), Joan Josep Piles-Contreras <=