[HTCondor-devel] OoM detection bug (Re: [HTCondor-users] out-of-memory event?)

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date:	Thu, 26 Oct 2017 16:13:10 +0200
From:	Joan Josep Piles-Contreras <jpiles@xxxxxxxxxxxxxxxx>
Subject:	[HTCondor-devel] OoM detection bug (Re: [HTCondor-users] out-of-memory event?)

Sorry for the cross-posting.

Well, I think I know the reason for this bug... Since I'm afraid Icontributed this piece of code [1].

The cgroup OoM event is triggered also when a cgroup is being deleted,and condor uses a shortcut to know if it's in a true OoM event, or in a"fake" one created at the destruction of the cgroup (so, in a proper jobfinish).

This shortcut is to check if there are still processes in the processgroup: it's assumed that if the job has finished, by the time condorreaches this point that all the processes have been killed. As I saidback then, it didn't look too clean, but it did the job.

Now the problem is (and that's something that we've also experienced)that sometimes a job might end while leaving around "stray" processes,either because they couldn't be killed or because of some other reason(daemon processes perhaps?). In this case it will be wrongly processedas a OoM event.

It could be that the bug is that those processes are not killed whenthey should, but there might be situations where not even "kill -9" doesthe trick (i.e. i/o stalled processes). A better mechanism for detectinga true OoM situation would be needed, but regrettably I'm not confidentenough to suggest an alternative.

I hope somebody knows cgroups better than me and can say which would bethe proper check in this case.


Best,

Joan

[1]:https://www-auth.cs.wisc.edu/lists/htcondor-users/2013-July/msg00103.shtml


On 10/26/2017 03:10 PM, Michael Di Domenico wrote:

On Thu, Oct 12, 2017 at 10:33 AM, Michael Di Domenico
<mdidomenico4@xxxxxxxxx> wrote:

found further evidence in the starterlog

"job was held due to OOM event: job has encountered an out-of-memory event"

however, when i look through the system logs, the OOM killer doesn't
seem to have killed anything.


turns out this might have been a bit of a red herring.  after several
days i finally tracked down that the jobs were failing on only a few
specific hosts and at exactly the same time everyday.  turns out there
is a cronjob on those machines that does 'systemctl restart
gdm.service'

it's not clear exactly why restarting gdm kills off the jobs, it's
also not clear why condor thinks this is an out-of-memory event.  my
own supposition is that condor assumes if a job is killed by someone
other then itself it must have been OOM.  but i don't know the code
i'm likely wrong
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Dr. Joan Josep Piles-Contreras
ZWE Scientific Computing
Max Planck Institute for Intelligent Systems
(p) +49 7071 601 1750

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

[← Prev in Thread]	Current Thread	[Next in Thread→]
[HTCondor-devel] OoM detection bug (Re: [HTCondor-users] out-of-memory event?), Joan Josep Piles-Contreras <=

Previous by Date:	, (nil)
Next by Date:	, (nil)
Previous by Thread:	, (nil)
Next by Thread:	, (nil)
Indexes:	[Date] [Thread]

Mailing List Archives

Authenticated access

[HTCondor-devel] OoM detection bug (Re: [HTCondor-users] out-of-memory event?)