Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Condor-related signal 11
- Date: Thu, 28 Feb 2008 00:36:41 -0600
- From: Jaime Frey <jfrey@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Condor-related signal 11
On Feb 26, 2008, at 6:03 PM, Nickolas Fotopoulos wrote:
I have a very mysterious problem that I suspect points to a problem in
our Condor configuration or a bug in Condor.
* I submit a DAG and the several jobs come back failing with signal
11. Job .err and .out files are empty.
* I run locally and a job succeeds
* I run with condor_run and a job succeeds
* I rsh to a node that gave a signal 11 and a job succeeds
* Attaching an strace to the process shows that it dies mid-
computation, not during any I/O or anything.
So the only way to get the signal 11 is to run the job through
dagman. I believe we're running Condor 6.9.4 with the dagman 7.0
binaries pre-released to LIGO (this is the LIGO Nemo cluster at UWM).
Any and all help would be appreciated.
Have you tried submitting the job to Condor using the submit
description file? That should more closely resemble how the job is run
under DAGMan.
Also try wrapping the job in a script that prints out the environment.
Then duplicate that exact environment on one of the execution machines
and run the job by hand.
Thanks and regards,
Jaime Frey
UW-Madison Condor Team