Kent, Thanks for helping me dig into this. On 10/19/05, R. Kent Wenger <wenger@xxxxxxxxxxx> wrote: > - When you say that you are doing recursion, are you re-submitting the > same DAG file or a different DAG file? If you're re-submitting the same > DAG file, it's not surprising that you're running into problems. Nope. It's a different DAG each time. I've confirmed that the only files in common between DAGs are the scripts (ie. "Executable = script.pl in the job files). > - Do you get a dagman.out file for the DAG that fails? It would help > a lot if we could see that. No problem. I've attached two. The first (test.dag.1.dagman.out) worked and the second (test.dag.2.dagman.out) failed. > - If you take one of the DAGs that fails as a subdag, and just run it on > its own, does it still sometimes fail? Nope. It works all the time. Of course, one of *its* subdags will usually fail. > > 005 (518.000.000) 10/18 15:42:55 Job terminated. > > (0) Abnormal termination (signal 9) > > (0) No core file > > Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage > > Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage > > Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage > > Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage > > 0 - Run Bytes Sent By Job > > 0 - Run Bytes Received By Job > > 0 - Total Bytes Sent By Job > > 0 - Total Bytes Received By Job > > You're saying that job 518.0 is one of the condor_dagman jobs, right? Yes. That came from a dagman.log file, and it seems like only condor_dagman jobs get logged there. Here's a full dagman.log from a DAG that failed (it corresponds with test.dag.2.dagman.out above): 000 (562.000.000) 10/19 13:25:55 Job submitted from host: <136.159.220.105:48532> ... 001 (562.000.000) 10/19 13:25:55 Job executing on host: <136.159.220.105:48532> ... 005 (562.000.000) 10/19 13:25:55 Job terminated. (0) Abnormal termination (signal 9) (0) No core file Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... Mark
Attachment:
test.dag.1.dagman.out
Description: Binary data
Attachment:
test.dag.2.dagman.out
Description: Binary data