[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_dagman.exe in idle after submit jobs completed



On Wed, 15 Jun 2011, Michael O'Donnell wrote:

Thank you for your comments. It looks like I am going to have to spend a
lot more time investigating this because it is not evident what has
happened. Most of the jobs did complete, but something happened to the
communication between the jobs and the condor_dagman.exe. I do not know
the communication process yet, but I did not see any errors in the dagman
log or anything. Basically the dagman went into recovery mode and could
never exit this recovery loop.  When it went into recovery mode it
generated this file: dprintf_failure.DAGMAN. If I delete the file it would
generate it again on the next recovery attempt.

When I released the condor_dagman job, a recovery file was not generated.
I then tried to rerun the dag and the following happened:
dprintf_failure.DAGMAN was generated again
condor_dagman job went into idle
no dag jobs were submitted
condor_dagman.exe would not exit without forcing it

Hmm, something else to check: is your disk full? And are file permissions set to reasonable values? (DAGMan monitors the node jobs by reading their user log files.)

Also, what are the contents of the dprintf_failure.DAGMAN file?

Kent Wenger
Condor Team