On Tue, 30 Jan 2007, Robert Mortensen wrote:
I'm having a problem with dagman on an all Windows XP pool. Basically
what happens, occasionally, is that a dagman job exits before
completing all nodes. It is then is restarted and it completes the
remaining nodes, but then hangs waiting, I think, for some "phantom"
node to complete. There are three problems:
1 - dagman appears to exit for no reason, with no errors in any logs
that I can find
2 - after recovering, dagman hangs after all the nodes have been
submitted and completed
3 - the delay in dagman recovering is nearly 1 hour
...
We're looking into this.
One thing that might help would be to also have the
master.dag.dagman.log
and master.dag.lib.out files if you still have them.
Also, it would help if you increased the verbosity of the DAGMan
output,
and sent the resulting dagman.out file when/if this happens again.
There are two separate verbosity controls (that control different
output).
Please do the following:
- Add the setting '-debug 5' on your condor_submit_dag command line.
- Set the configuration macro DAGMAN_DEBUG to D_FULLDEBUG. You can do
this in a couple of ways:
- Put 'DAGMAN_DEBUG = D_FULLDEBUG' into an appropriate
configuration
file.
- Set _CONDOR_DAGMAN_DEBUG to D_FULLDEBUG in your environment
before
running condor_submit_dag.
- You can address number 3 by setting the
DAGMAN_NOT_RESPONDING_TIMEOUT
configuration macro to a value shorter than the default (which is
3600
seconds).
Kent Wenger
Condor Team
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR