Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0

Date: Thu, 17 Aug 2006 09:50:29 -0500
From: "Peter F. Couvares" <pfc@xxxxxxxxxxx>
Subject: Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0

Horvátth,

I'm not sure I understand what you're doing -- but I'm not surprisedit stopped working, as it's akin to brain surgery on a live, movingpatient. :)

If the issue is jobs which fail sometimes due to factors outside yourcontrol, but which succeed if re-submitted, then why not use DAGMan'sRETRY feature?

If that's not sufficient, please describe the problem in a littlemore detail. I'm optimistic there's a better solution than usingcondor_qedit. DAGMan's underlying implementation is obviouslysubject to change, so relying on a script which circumvents thesupported API & semantics is going to be fragile.


-Peter


On Aug 17, 2006, at 9:02 AM, Horvátth Szabolcs wrote:

For quite a while - using the 6.7.x series - we used a script torestart

parent dependent child jobs by traversing the hierarchy
and restarting jobs (using hold + release) that were required for the
completion of a child job. (Sometimes software license issues,

disk problems or data read / write errors can make a task unusablefor a

while although restarting after a short amount of time makes
it work and the whole dag continue.)

The script restarts the parent jobs, waits for their completion and
after completion it modifies the child jobs' data using qedit

and restarts the child jobs.(hold and release again). Now thisworked ok

with 6.7 but with 6.8 I get a DAG error message in the dagman.out file
and *all* tasks in the DAGMan job goes into the removed state. The
reason being: RemoveReason = "via condor_rm (by user szabolcs)"

8/17 15:53:02 BAD EVENT: job (34202.0.0) executing, total endcount != 0 (1)

8/17 15:53:02 ERROR: aborting DAG because of bad event (BAD EVENT: job
(34202.0.0) executing, total end count != 0 (1))
8/17 15:53:02 Aborting DAG...

Now this is not really good for me. Could you tell me what happensunder

the hood? How can I avoid it and get my script working or simply
disable this "error" checking?

Thanks in advance!

Cheers,
Szabolcs


--
Peter Couvares                        University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
pfc@xxxxxxxxxxx                       1210 W. Dayton St. Rm #4241
(608) 265-8936                        Madison, WI 53706-1685

Follow-Ups:
- Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
  - From: Horvátth Szabolcs

References:
- [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
  - From: Horvátth Szabolcs

Prev by Date: Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
Next by Date: [Condor-users] Windows MSI Installer and Config File
Previous by thread: Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
Next by thread: Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0