Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Good way to start failed jobs from large cluster?
- Date: Wed, 03 Jun 2009 10:17:31 -0500
- From: Matthew Farrellee <matt@xxxxxxxxxx>
- Subject: Re: [Condor-users] Good way to start failed jobs from large cluster?
Carsten Aulbert wrote:
> Hi,
>
> as an admin I'm out of condor submit file magic for some time and would
> like to know if there is an easy way to accomplish this:
>
> Imagine a user using vanilla universe and large clusters using a submit
> file like this:
>
> universe = vanilla
> Arguments = -j $(Process)
> log = /home/user/log/$(Process).log
> error = /home/user/log/$(Process).err
> executable = /home/user/bin/IWillFindIt.exe
> notification = Never
> queue 45345
>
> Now imagine this ran for a while but 134 jobs with more or less random
> numbers failed, e.g.
>
> 5.6, 5.1345, 5.8733, ...
>
> What is a good way to restart only these? So far I help me with this:
>
> for i in `magic_which_will_outpuy_me_process_ids_only`; do
> cat <<EOF | condor_submit
> universe = vanilla
> Arguments = -j $i
> log = /home/user/log/$i.log
> error = /home/user/log/$i.err
> executable = /home/user/bin/IWillFindIt.exe
> notification = Never
> queue
> EOF
> done
>
> Is there a better way to get this?
>
> Please note: I need to get the log, error as well argument line correctly.
>
> Cheers
>
> Carsten
Two quick thoughts,
1) you can probably do condor_history -format "%d\n" ProcId -constraint
"ClusterId == ?? && [magic to identify proc as a failure]" to get your Ids
2) are your failed jobs being removed from the queue, why not use an
OnExit policy to put them on hold when [magic to identify proc as a
failure] is identified. This would let you avoid the resubmission, you'd
just have to release the jobs for them to run again.
Best,
matt