Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Completed jobs stuck on node.
- Date: Wed, 21 Aug 2013 11:33:01 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Completed jobs stuck on node.
On 8/21/2013 8:27 AM, Michael Murphy wrote:
At the end of the report it looks like pid 30774 was spawned but never
exited. This is confirmed by ps
$ ps -aux | grep condor_
condor 30467 0.0 0.0 91808 5468 ? Ss Aug08 0:06
/usr/sbin/condor_master -pidfile /var/run/condor/condor.pid
condor 30468 0.0 0.0 92000 7668 ? Ss Aug08 0:31
condor_startd -f
root 30502 0.0 0.0 23280 2928 ? S Aug08 0:22
condor_procd -A /var/run/condor/procd_pipe.STARTD -L
/var/log/condor/ProcLog.STARTD -R 10000000 -S 60 -C 104
condor 30770 0.0 0.0 91172 6980 ? Ss Aug08 0:00
condor_starter -f -a slot2 192.168.1.93
nobody 30774 353 5.9 3879252 2941908 ? SNsl Aug08 3226:37
condor_exec.exe <name removed>
So your suspicions were correct. How would I fix this? The program vlox
(stuck job executable) completes normally outside of condor for the same
batch of 20 run manually.
Some thoughts:
Maybe on a stuck job, try running ssh_to_job - this would allow you to
nose around in the environment of the job, look at all files, attach
with a debugger, do whatever to figure out what is happening in real-time.
Whenever I hear "it runs outside HTCondor fine, but fails when HTCondor
runs it", I immediately think permissions, ownerships, and environment
variables. This is what is usually different between a program running
inside vs outside HTCondor. For instance, when you are testing outside
of HTCondor you are running as user "michael" (or whatever), but from
the above it looks like HTCondor is configured to run your jobs as user
"nobody". Try su-ing to nobody and see if your program works. Re
environment variables, try doing "getenv=True" in your submit file to
pick up all your environment variables.
Another thought - try condor_submit -i <submit file>. This will start
an interactive login/shell on an execute node with the exact same setup
that HTCondor uses to run your jobs (you will be user nobody, same
environment, permissions, etc). Try running your job interactive that
way and see if you discovery why your program is hanging.
Maybe it encountered some error condition and is sitting around waiting
for console input? If you do not already use stdin with your job, maybe
a file full of "yes" or whatever and specify this file with
"input=filename" to have HTCondor use this file as stdin?
Sorry if the above is not very specific, but not sure else HTCondor
could do here.... if your program is sitting around not exiting, one
would hope it is writing something to either stdout or stderr. One last
idea - maybe submit some jobs and specify streaming stdout/err by placing
stream_output = True
stream_error = True
output = myjob.out
error = myjob.err
in your submit file ... the advantage to streaming the stdout/err in
realtime back to the submit machine is it will be flushed often. This
way if your program is giving some clue as to why it is not exiting, it
won't be cached in some stdio buffer someplace where you cannot see it.
Hope the above random thoughts help, please let us know what you figure out,
Todd