Kent,
Thanks for helping me dig into this.
On 10/19/05, R. Kent Wenger <wenger@xxxxxxxxxxx> wrote:
> - When you say that you are doing recursion, are you re-submitting the
> same DAG file or a different DAG file? If you're re-submitting the same
> DAG file, it's not surprising that you're running into problems.
Nope. It's a different DAG each time. I've confirmed that the only
files in common between DAGs are the scripts (ie. "Executable =
script.pl in the job files).
> - Do you get a dagman.out file for the DAG that fails? It would help
> a lot if we could see that.
No problem. I've attached two. The first (test.dag.1.dagman.out)
worked and the second (test.dag.2.dagman.out) failed.
> - If you take one of the DAGs that fails as a subdag, and just run it on
> its own, does it still sometimes fail?
Nope. It works all the time. Of course, one of *its* subdags will usually fail.
> > 005 (518.000.000) 10/18 15:42:55 Job terminated.
> > (0) Abnormal termination (signal 9)
> > (0) No core file
> > Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
> > Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
> > Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
> > Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
> > 0 - Run Bytes Sent By Job
> > 0 - Run Bytes Received By Job
> > 0 - Total Bytes Sent By Job
> > 0 - Total Bytes Received By Job
>
> You're saying that job 518.0 is one of the condor_dagman jobs, right?
Yes. That came from a dagman.log file, and it seems like only
condor_dagman jobs get logged there. Here's a full dagman.log from a
DAG that failed (it corresponds with test.dag.2.dagman.out above):
000 (562.000.000) 10/19 13:25:55 Job submitted from host:
<136.159.220.105:48532>
...
001 (562.000.000) 10/19 13:25:55 Job executing on host: <136.159.220.105:48532>
...
005 (562.000.000) 10/19 13:25:55 Job terminated.
(0) Abnormal termination (signal 9)
(0) No core file
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
Mark
Attachment:
test.dag.1.dagman.out
Description: Binary data
Attachment:
test.dag.2.dagman.out
Description: Binary data