Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Good way to start failed jobs from large cluster?

Date: Wed, 03 Jun 2009 10:17:31 -0500
From: Matthew Farrellee <matt@xxxxxxxxxx>
Subject: Re: [Condor-users] Good way to start failed jobs from large cluster?

Carsten Aulbert wrote:
> Hi,
> 
> as an admin I'm out of condor submit file magic for some time and would
> like to know if there is an easy way to accomplish this:
> 
> Imagine a user using vanilla universe and large clusters using a submit
> file like this:
> 
> universe                = vanilla
> Arguments               = -j $(Process)
> log                     = /home/user/log/$(Process).log
> error                   = /home/user/log/$(Process).err
> executable              = /home/user/bin/IWillFindIt.exe
> notification            = Never
> queue 45345
> 
> Now imagine this ran for a while but 134 jobs with more or less random
> numbers failed, e.g.
> 
> 5.6, 5.1345, 5.8733, ...
> 
> What is a good way to restart only these? So far I help me with this:
> 
> for i in `magic_which_will_outpuy_me_process_ids_only`; do
> cat <<EOF | condor_submit
> universe                = vanilla
> Arguments               = -j $i
> log                     = /home/user/log/$i.log
> error                   = /home/user/log/$i.err
> executable              = /home/user/bin/IWillFindIt.exe
> notification            = Never
> queue
> EOF
> done
> 
> Is there a better way to get this?
> 
> Please note: I need to get the log, error as well argument line correctly.
> 
> Cheers
> 
> Carsten

Two quick thoughts,

1) you can probably do condor_history -format "%d\n" ProcId -constraint
"ClusterId == ?? && [magic to identify proc as a failure]" to get your Ids

2) are your failed jobs being removed from the queue, why not use an
OnExit policy to put them on hold when [magic to identify proc as a
failure] is identified. This would let you avoid the resubmission, you'd
just have to release the jobs for them to run again.

Best,


matt

Follow-Ups:
- Re: [Condor-users] Good way to start failed jobs from large cluster?
  - From: Carsten Aulbert

References:
- [Condor-users] Good way to start failed jobs from large cluster?
  - From: Carsten Aulbert

Prev by Date: [Condor-users] Good way to start failed jobs from large cluster?
Next by Date: Re: [Condor-users] Good way to start failed jobs from large cluster?
Previous by thread: [Condor-users] Good way to start failed jobs from large cluster?
Next by thread: Re: [Condor-users] Good way to start failed jobs from large cluster?
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Good way to start failed jobs from large cluster?