Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [condor-users] Dagman stalling with shadow exception messages?
- Date: Wed, 07 Apr 2004 11:37:09 -0500
- From: Dan Bradley <dan@xxxxxxxxxxxx>
- Subject: Re: [condor-users] Dagman stalling with shadow exception messages?
Michael S. Root wrote:
The only workaround seems to be to delete the dag
job from the queue and re-submit the remaining jobs (which then proceed to
run fine).
Do you mean that you are having to manually submit each of the remaining
jobs? DAGMan should be creating a rescue DAG when you remove it from
the queue (with condor_rm). You can run the rescue DAG and DAGMan will
submit jobs that were not successfully finished in the first attempt.
Of course, the real problem is why the DAG is not completing in the
first place, but I just want to make sure everything else is sane. If
DAGMan is in some crazy state where it can't even generate the rescue
DAG, then this is an important point.
4/6 21:00:02 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_handler)
4/6 21:00:02 Error: can't find resource with capability
(<192.168.1.111:32771>#7698602094)
----------------------
Note: That last line puzzles me. I don't know what the #7698602094 referrs
to.
This is perfectly normal (both the message and the puzzlement).
Glancing at your two log files, it looks to me like the times don't
match up, so we can't see what happened on the execution side when the
shadow lost contact with the starter.
Whatever may have happened to cause the run attempt to fail, this
shouldn't have caused DAGMan to get stuck, but if you are seeing a
correlation, then there may be a problem.
Is there's any chance that the disk containing the job state log file(s)
was ever full?
--Dan
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>