Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Condor-users] condor_rm not killing subprocesses
- Date: Mon, 6 Jun 2005 18:28:07 +0100 (BST)
- From: Bruce Beckles <mbb10@xxxxxxxxx>
- Subject: RE: [Condor-users] condor_rm not killing subprocesses
I'm a little confused by your note of no operating system support. I
have indicated a reliable way of finding these processes, at least on
Linux. I now only seek some way of having Condor use this method, even
if it means wrapping the condor executables.
If you are using the dynamically linked Condor executables you could
always write your own replacement signalling functions, put them in a
library and use LD_PRELOAD to have your library handle Condor's kill
signals. That's a non-trivial amount of work, of course, but it would do
it.
Alternatively, you could (using the USER_JOB_WRAPPER feature) arrange for
a process to start up that ptraces the user's job's main process, and,
when it dies, sends a kill signal to all its children. That is also a bit
nasty to implement, and there are various edge-cases you need to consider.
A simpler solution is to use either the system cron facility or Condor's
STARTD_CRON facility to run a job once a minute that checks to see if a
Condor job is supposed to be running; if so it exits - if not, it looks
for any stray child processes and kills them. That's what we do here.
Of course, you run into problems if Condor starts another job within a
minute of the previous job finishing...
Perhaps the simplest solution (using the USER_JOB_WRAPPER feature) is to
have a wrapper that kills any stray processes left behind by the previous
job when a new job starts. That assumes that all jobs run under the same
UID, of course, or else you have to get clever with something like sudo or
userv (*) or sud (**). But if all jobs run under the same UID, you might
as well use dedicated user acoounts as I mentioned in my previous reply.
(*) http://www.chiark.greenend.org.uk/~ian/userv/
(**) http://sud.sourceforge.net/
For reference (in the 6.6 series):
USER_JOB_WRAPPER: http://www.cs.wisc.edu/condor/manual/v6.6/3_3Configuration.html#9119
STARTD_CRON: http://www.cs.wisc.edu/condor/manual/v6.6/3_3Configuration.html#8744
Hope that is of some use/interest,
-- Bruce
--
Bruce Beckles,
e-Science Specialist,
University of Cambridge Computing Service.