HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] Improved process tracking for RHEL4/5



Hi all,

Our local PBS cluster has had long-standing issues with user jobs launching daemons (mostly MPI related).  After getting frustrated with the latest round of issues, I wrote a little program I call "proc_police":

svn://t2.unl.edu/brian/proc_police
http://koji.hep.caltech.edu/koji/buildinfo?buildID=565

This is based on ideas written up here by the author of upstart:

http://netsplit.com/2011/02/09/the-proc-connector-and-socket-filters/

Basically, it subscribes a socket to an event feed from the kernel which notifies the program of all fork/exit occurrences in the OS.  I use this to determine whether a process is related to the batch system, keep an in-memory process tree, and kill any process which re-parents to init.

It's a very resilient technique, but no replacement for cgroups-based tracking.  A sufficiently aggressive fork-bomb should be able to overwhelm the socket buffer -- but the kernel at least informs us when this happens.

Anyhow, I thought it would be worth sharing with this group - it would be a fairly good addition to the procd on Linux, especially as it only requires 2.6 or later.

Brian