Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Good way to start failed jobs from large cluster?
- Date: Wed, 03 Jun 2009 20:09:20 +0200
- From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
- Subject: Re: [Condor-users] Good way to start failed jobs from large cluster?
Hi Matt,
Matthew Farrellee wrote:
> 1) you can probably do condor_history -format "%d\n" ProcId -constraint
> "ClusterId == ?? && [magic to identify proc as a failure]" to get your Ids
In this case even easier, just looking at condor_q (see below)
>
> 2) are your failed jobs being removed from the queue, why not use an
> OnExit policy to put them on hold when [magic to identify proc as a
> failure] is identified. This would let you avoid the resubmission, you'd
> just have to release the jobs for them to run again.
In this case the user's job was running compiled Matlab code and it
seems that due to a race condition (which another user won) quite a few
jobs were still running and doing stupid things (stat directory, try to
open it, failing, sleep for .1s, stat dir again..., doing that for 2 days).
Thus getting the IDs was easy enough with condor_q. Condor_hold/_release
helped this time, but after that a few jobs showed some weird patterns
in the results and these we then wanted to run again (this could
have/could not have been linked the the incidence). Thus the quetion if
this simple for loop was nearly optimal already.
OnExit would not really help since human intervention/analysis was
needed on the results to find this issue.
Cheers
Carsten