[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-devel] Improved process tracking for RHEL4/5
- Date: Sat, 28 May 2011 12:00:59 -0500
- From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
- Subject: [Condor-devel] Improved process tracking for RHEL4/5
Hi all,
Our local PBS cluster has had long-standing issues with user jobs launching daemons (mostly MPI related). After getting frustrated with the latest round of issues, I wrote a little program I call "proc_police":
svn://t2.unl.edu/brian/proc_police
http://koji.hep.caltech.edu/koji/buildinfo?buildID=565
This is based on ideas written up here by the author of upstart:
http://netsplit.com/2011/02/09/the-proc-connector-and-socket-filters/
Basically, it subscribes a socket to an event feed from the kernel which notifies the program of all fork/exit occurrences in the OS. I use this to determine whether a process is related to the batch system, keep an in-memory process tree, and kill any process which re-parents to init.
It's a very resilient technique, but no replacement for cgroups-based tracking. A sufficiently aggressive fork-bomb should be able to overwhelm the socket buffer -- but the kernel at least informs us when this happens.
Anyhow, I thought it would be worth sharing with this group - it would be a fairly good addition to the procd on Linux, especially as it only requires 2.6 or later.
Brian