[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor-g job status unknown caused dagman to exit




On 4/20/10 12:59 PM, R. Kent Wenger wrote:
> You could try working around it by setting a non-default value for
> DAGMAN_ALLOW_EVENTS
...
> I'm actually not quite sure what DAGMan will do if it doesn't immediately
> consider the execute event "bad" -- it might still run into problems
> farther along, but it seems worth trying.

We've had DAGs abort because of bad events about once every 30k jobs
using the default setting, if I had to guess.  This amounts to about
once every 5 DAGs that we submit.   If we allow all bad events and then
perhaps just keep track of how often they happen we can see how things
go.  The main thing we'd need to avoid would be double execution of the
same job -- this could easily result in two jobs writing to the same
output file on completion.  That could lead to some confusing results.

Anyway, if it reduces the number of aborted DAGs without introducing
other negative side-effects, it seems like the best plan is to allow all
errors.

Ian