Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Good way to start failed jobs from large cluster?

Date: Wed, 03 Jun 2009 20:09:20 +0200
From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
Subject: Re: [Condor-users] Good way to start failed jobs from large cluster?

Hi Matt,

Matthew Farrellee wrote:
> 1) you can probably do condor_history -format "%d\n" ProcId -constraint
> "ClusterId == ?? && [magic to identify proc as a failure]" to get your Ids

In this case even easier, just looking at condor_q (see below)
> 
> 2) are your failed jobs being removed from the queue, why not use an
> OnExit policy to put them on hold when [magic to identify proc as a
> failure] is identified. This would let you avoid the resubmission, you'd
> just have to release the jobs for them to run again.

In this case the user's job was running compiled Matlab code and it
seems that due to a race condition (which another user won) quite a few
jobs were still running and doing stupid things (stat directory, try to
open it, failing, sleep for .1s, stat dir again..., doing that for 2 days).

Thus getting the IDs was easy enough with condor_q. Condor_hold/_release
helped this time, but after that a few jobs showed some weird patterns
in the results and these we then wanted to run again (this could
have/could not have been linked the the incidence). Thus the quetion if
this simple for loop was nearly optimal already.

OnExit would not really help since human intervention/analysis was
needed on the results to find this issue.

Cheers

Carsten

References:
- [Condor-users] Good way to start failed jobs from large cluster?
  - From: Carsten Aulbert
- Re: [Condor-users] Good way to start failed jobs from large cluster?
  - From: Matthew Farrellee

Prev by Date: Re: [Condor-users] Good way to start failed jobs from large cluster?
Next by Date: [Condor-users] using $$(Arch) and $$(OpSys) with condor_submit -remote
Previous by thread: Re: [Condor-users] Good way to start failed jobs from large cluster?
Next by thread: [Condor-users] using $$(Arch) and $$(OpSys) with condor_submit -remote
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Good way to start failed jobs from large cluster?