For quite a while - using the 6.7.x series - we used a script to
restart
parent dependent child jobs by traversing the hierarchy
and restarting jobs (using hold + release) that were required for the
completion of a child job. (Sometimes software license issues,
disk problems or data read / write errors can make a task unusable
for a
while although restarting after a short amount of time makes
it work and the whole dag continue.)
The script restarts the parent jobs, waits for their completion and
after completion it modifies the child jobs' data using qedit
and restarts the child jobs.(hold and release again). Now this
worked ok
with 6.7 but with 6.8 I get a DAG error message in the dagman.out file
and *all* tasks in the DAGMan job goes into the removed state. The
reason being: RemoveReason = "via condor_rm (by user szabolcs)"
8/17 15:53:02 BAD EVENT: job (34202.0.0) executing, total end
count != 0 (1)
8/17 15:53:02 ERROR: aborting DAG because of bad event (BAD EVENT: job
(34202.0.0) executing, total end count != 0 (1))
8/17 15:53:02 Aborting DAG...
Now this is not really good for me. Could you tell me what happens
under
the hood? How can I avoid it and get my script working or simply
disable this "error" checking?
Thanks in advance!
Cheers,
Szabolcs