Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Trouble running multithreaded job in vanilla universe
- Date: Wed, 24 Nov 2010 07:54:39 -0800 (PST)
- From: Christopher Whelan <whelanch@xxxxxxxxxxxx>
- Subject: [Condor-users] Trouble running multithreaded job in vanilla universe
Hi all,
I'm a new condor user after our cluster switched from SGE, and I was
hoping someone might be able to help me out with some trouble I'm having
running one of my jobs. I've run a few jobs successfully so far, but I'm
having a lot of trouble getting one of my processes to run, and I'm
wondering if it's because of the multithreading the application I'm
running uses. My condor executable is a bash script that launches a binary
application (a short read aligner, in case any of you are in
bioinformatics.) My problem is that the job appears to be picked up and
run, but terminates immediately. The job output looks the same as it would
if I executed it from the command line and then pressed ^C immediately.
I've tried executing it manually on the machine it's being run on, and it
works there. As I said before, the application is multithreaded, and I'm
wondering if maybe the top level thread goes to sleep while it waits for its
worker threads, and condor thinks it's done and interrupts the job?
Any advice anyone might have would be much appreciated - even tips on where to look to
diagnose the problem would be very helpful. Details are below.
Thanks in advance,
Chris
Unfortunately I don't have access to the application source so I can't see
exactly what it's doing threadwise. Here's my job description file:
Executable = launchapp.sh
Universe = vanilla
output = job_out/launchap.out
error = job_out/launchap.error
Log = /tmp/whelanch_condor.log
Notification = Never
Initialdir = .
Queue
In my job output file I get this, which is the same message I see if I
manually kill the application right after launching it from the command
prompt:
Interrupted..11
Obtained 0 stack frames.
The StarterLog looks like this:
11/23 18:15:57 ******************************************************
11/23 18:15:57 ** condor_starter (CONDOR_STARTER) STARTING UP
11/23 18:15:57 ** /usr/sbin/condor_starter
11/23 18:15:57 ** SubsystemInfo: name=STARTER type=STARTER(8)
class=DAEMON(1)
11/23 18:15:57 ** Configuration: subsystem:STARTER local:<NONE>
class:DAEMON
11/23 18:15:57 ** $CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $
11/23 18:15:57 ** $CondorPlatform: X86_64-LINUX_DEBIAN50 $
11/23 18:15:57 ** PID = 3185
11/23 18:15:57 ** Log last touched 11/23 17:55:10
11/23 18:15:57 ******************************************************
11/23 18:15:57 Using config source: /etc/condor/condor_config
11/23 18:15:57 Using local config sources:
11/23 18:15:57 /l2/condor/condor_config.cluster
11/23 18:15:57 /l2/condor/condor_config.eagle1
11/23 18:15:57 DaemonCore: Command Socket at <129.95.39.41:41009>
11/23 18:15:57 Done setting resource limits
11/23 18:15:57 Communicating with shadow <129.95.39.73:41785>
11/23 18:15:57 Submitting machine is "ostrich3.csee.ogi.edu"
11/23 18:15:57 setting the orig job name in starter
11/23 18:15:57 setting the orig job iwd in starter
11/23 18:15:57 Job 24.0 set to execute immediately
11/23 18:15:57 Starting a VANILLA universe job with ID: 24.0
11/23 18:15:57 IWD: /l2/users/whelanch/scripts/.
11/23 18:15:57 Output file:
/l2/users/whelanch/scripts/./job_out/launchapp.out
11/23 18:15:57 Error file:
/l2/users/whelanch/scripts/./job_out/launchapp.error
11/23 18:15:57 About to exec
/l2/users/whelanch/scripts/launchapp.sh
11/23 18:15:57 Create_Process succeeded, pid=3186
11/23 18:15:57 Process exited, pid=3186, status=0
11/23 18:15:57 Got SIGQUIT. Performing fast shutdown.
11/23 18:15:57 ShutdownFast all jobs.
11/23 18:15:57 **** condor_starter (condor_STARTER) pid 3185 EXITING WITH
STATUS 0
Anywhere else I should look?