Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] condor-g job status unknown caused dagman to exit
- Date: Tue, 20 Apr 2010 11:59:10 -0500 (CDT)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] condor-g job status unknown caused dagman to exit
On Tue, 20 Apr 2010, Peter Doherty wrote:
To the condor experts,
I'm using Condor-G and this job (see logs below) caused a large dag to exit
unexpectedly this morning. I understand the concept of BAD EVENTS ( in this
case the job started executing after it was supposed to be done)
But I want to know how to prevent this from happening.
From what I can see the job's state became unknown (What causes this? I see
it happen a lot actually)
and then Condor abandoned the job, but then the job called back home and
instead of just ignoring it, it heard the call, but then got confused, and
dagman exited.
It seems like a bug to me.
You could try working around it by setting a non-default value for
DAGMAN_ALLOW_EVENTS (see
http://www.cs.wisc.edu/condor/manual/v7.4/3_3Configuration.html#19172
in the manual).
I'm actually not quite sure what DAGMan will do if it doesn't immediately
consider the execute event "bad" -- it might still run into problems
farther along, but it seems worth trying.
It seems like the real question is whether this is something that should
be dealt with at the DAGMan level or the Condor level -- we'll have to
discuss that on our end. If we deal with it at the DAGMan level, we'd
have to allow a new job state transition -- from completed/failed to
executing to possibly completed/succeeded.
Kent Wenger
Condor Team