On Thu, 3 Nov 2011, Christopher Martin wrote:
So from what I can see it's like you say, it's as if the dagman can't tell that the jobs have completed successfully. The job logs do indicate completion though. I'm wondering, do the jobs all have to log to the same log file? Currently I have each job logging to it's own log file. All logs for both the jobs and the dagman are logging to the same directory. I've included snippets from a dagman.out that shows the state of things before and after the schedd restart.
It's fine to have any combination of jobs logging to their own log files vs. jobs logging to a common log file. It's important, though, that jobs in separate DAGs not share log files (unless you're 100% sure the DAGs won't be run at the same time).
Can you send the following files?: * dagman.out * the actual dag file * the node job log filesIf you do that, I'll take a look in more detail and see what I can figure out.
From your original email, it sounds like this problem happens consistentlywhen your schedd restarts -- is that right? If so, that eliminates the things that would be my first guesses as to the problem (e.g., some kind of transient log file reading error).
Kent Wenger Condor Team