Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Completed jobs stuck on node.

Date: Wed, 21 Aug 2013 11:33:01 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Completed jobs stuck on node.

On 8/21/2013 8:27 AM, Michael Murphy wrote:

At the end of the report it looks like pid 30774 was spawned but never
exited.  This is confirmed by ps

$ ps -aux | grep condor_
condor   30467  0.0  0.0  91808  5468 ?        Ss   Aug08   0:06
/usr/sbin/condor_master -pidfile /var/run/condor/condor.pid
condor   30468  0.0  0.0  92000  7668 ?        Ss   Aug08   0:31
condor_startd -f
root     30502  0.0  0.0  23280  2928 ?        S    Aug08   0:22
condor_procd -A /var/run/condor/procd_pipe.STARTD -L
/var/log/condor/ProcLog.STARTD -R 10000000 -S 60 -C 104
condor   30770  0.0  0.0  91172  6980 ?        Ss   Aug08   0:00
condor_starter -f -a slot2 192.168.1.93
nobody   30774  353  5.9 3879252 2941908 ?     SNsl Aug08 3226:37
condor_exec.exe <name removed>

So your suspicions were correct. How would I fix this?  The program vlox
(stuck job executable) completes normally outside of condor for the same
batch of 20 run manually.


Some thoughts:

Maybe on a stuck job, try running ssh_to_job - this would allow you tonose around in the environment of the job, look at all files, attachwith a debugger, do whatever to figure out what is happening in real-time.

Whenever I hear "it runs outside HTCondor fine, but fails when HTCondorruns it", I immediately think permissions, ownerships, and environmentvariables. This is what is usually different between a program runninginside vs outside HTCondor. For instance, when you are testing outsideof HTCondor you are running as user "michael" (or whatever), but fromthe above it looks like HTCondor is configured to run your jobs as user"nobody". Try su-ing to nobody and see if your program works. Reenvironment variables, try doing "getenv=True" in your submit file topick up all your environment variables.

Another thought - try condor_submit -i <submit file>. This will startan interactive login/shell on an execute node with the exact same setupthat HTCondor uses to run your jobs (you will be user nobody, sameenvironment, permissions, etc). Try running your job interactive thatway and see if you discovery why your program is hanging.

Maybe it encountered some error condition and is sitting around waitingfor console input? If you do not already use stdin with your job, maybea file full of "yes" or whatever and specify this file with"input=filename" to have HTCondor use this file as stdin?

Sorry if the above is not very specific, but not sure else HTCondorcould do here.... if your program is sitting around not exiting, onewould hope it is writing something to either stdout or stderr. One lastidea - maybe submit some jobs and specify streaming stdout/err by placing

  stream_output = True
  stream_error = True
  output = myjob.out
  error = myjob.err

in your submit file ... the advantage to streaming the stdout/err inrealtime back to the submit machine is it will be flushed often. Thisway if your program is giving some clue as to why it is not exiting, itwon't be cached in some stdio buffer someplace where you cannot see it.


Hope the above random thoughts help, please let us know what you figure out,
Todd

References:
- [HTCondor-users] Completed jobs stuck on node.
  - From: Michael McInerny Murphy
- Re: [HTCondor-users] Completed jobs stuck on node.
  - From: Todd Tannenbaum

Prev by Date: Re: [HTCondor-users] Ranking on number of cpu cores
Next by Date: Re: [HTCondor-users] How to hold/Release all dag jobs when hold/release dagman job?
Previous by thread: Re: [HTCondor-users] Completed jobs stuck on node.
Next by thread: [HTCondor-users] Unclaimed not Claimed
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Completed jobs stuck on node.