On Mon, 21 Dec 2009, dawnsong wrote:
It is fixed in the final. I set all the nodes share a same log file. As condor manual 7.3 said, DAGman support seperate logs by seperate nodes, but it seems that all nodes share one same log would make DAGman easy to run without complainent about "ERROR: failure to read job log". This confused me sine I have already upgraded to 7.4.
From your earlier email, it sounds like your log file(s) are on NFS; is
that correct? If so, that's most likely the source of your problems.When you upgraded the DAGMan version, did you re-run the DAG from scratch, or did you run it in recovery mode? Do you still have the log file that generated the error? If so, I'd like to take a look at it. I'm guessing that if the file was on NFS, you got corrupted events because of two events being written at the same time.
Kent Wenger Condor Team