Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Strange DAGMan behaviour
- Date: Wed, 19 Oct 2005 09:35:18 -0500 (CDT)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Strange DAGMan behaviour
On Tue, 18 Oct 2005, Mark Fox wrote:
> I ran into some strange DAGMan behaviour in a software system I
> maintain. The system submits a DAG that may recursively submit
> another DAG, and so on. The problem is that the execution of one of
> the DAGs eventually fails without ever having run. The first DAG
> always succeeds, but the following DAGs seem to have about a 50-50
> chance of success. Sometimes it will iterate several times, but most
> of the time, it fails on the first or second iteration. In the DAGMan
> log for the last DAG, I get this:
Several questions:
- When you say that you are doing recursion, are you re-submitting the
same DAG file or a different DAG file? If you're re-submitting the same
DAG file, it's not surprising that you're running into problems.
- Do you get a dagman.out file for the DAG that fails? It would help
a lot if we could see that.
- If you take one of the DAGs that fails as a subdag, and just run it on
its own, does it still sometimes fail?
> 005 (518.000.000) 10/18 15:42:55 Job terminated.
> (0) Abnormal termination (signal 9)
> (0) No core file
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
> 0 - Run Bytes Sent By Job
> 0 - Run Bytes Received By Job
> 0 - Total Bytes Sent By Job
> 0 - Total Bytes Received By Job
You're saying that job 518.0 is one of the condor_dagman jobs, right?
Kent Wenger
Condor Team