[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAGman duplicating jobs on schedd restart



On Thursday, 3 November, 2011 at 9:38 AM, Christopher Martin wrote:
Whenever the schedd restarts we're getting duplicate jobs showing up in the queue. For example if we have a DAG like the following:

JOB A
JOB B
JOB C
JOB D

PARENT A CHILD B
PARENT B CHILD C D

Before the schedd restart, jobs A and B have completed and jobs C and D are queued. After the schedd restarts we then have C and D still queued but B has been added back into the queue as well. Is this a peculiarity of the DAG rescue or perhaps it could be a conflict with the dagman logs?
Hi Chris,

What does DAGMan say? Anything the dag log that might be helpful?

I suspect the most likely cause for the resubmission of B is that DAGMan can't determine that it completed successfully. Is there a log not in the job log that indicates that B completed?

Regards,
- Ian

---
Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com
http://twitter.com/cyclecomputing