On Fri, 5 Aug 2005, Alexander Dietz wrote:
I was running a DAG on my submitting machine (Red Hat Enterprise Linux
AS release 3, condor version 6.7.8) whereas all the jobs shall be
executed on a remote machine (Fedora Core release 3 (Heidelberg), condor
version 6.7.8). Almost the full DAG completed, but then the dagman
aborts. Here are the last few lines from the dagman.out-file:
...
8/4 22:04:37 ERROR "Assertion ERROR on (job->GetStatus() ==
Job::STATUS_POSTRUN || recovery)" at line 772 in file dag.C
The user proxies on botch machines were still valid for a long time, and
then the dagman aborts without creating a rescue dag. Is there possibly
a bug in the file dag.C or whats going on?
Yes, you hit a known bug in DAGMan. The fix is in 6.7.10, which should be
coming out within a few days.