Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job Resubmission

Date: Fri, 05 Oct 2012 07:47:05 -0400
From: Matthew Farrellee <matt@xxxxxxxxxx>
Subject: Re: [Condor-users] Job Resubmission

On 10/05/2012 07:22 AM, Giles Wright wrote:

Hi - Just getting started with Condor. We're looking at running jobs on
a local network of Linux machines with the potential of adding in a
connection to an Amazon VPC in the future. Have set up a couple of
virtual condor pools at 2 separate locations and have configured
flocking between them.

My question right now is how Condor deals with resubmitting failed jobs.
For example, should a machine die during execution a job, it seems that
the job terminates as you would expect, however we would like any failed
jobs to be resubmitted to another available machine in the pool. Should
we be looking at Condor-G (not started there yet) and condor_resubmit?

Thanks,
Giles


Giles,

Condor provides more functionality around the resubmission use case thanmost other schedulers. And the default policy is setup in such a waythat most Condor folks don't ever think about "resubmission."

Condor will keep your job in the queue (condor_schedd managed) until thepolicy attached to the job says otherwise.

The default policy says a job will be run as many time as necessary forthe job to terminate. So if the machine a job is running on crashes(generally, becomes unavailable), the condor_schedd will automaticallytry to run the job on another machine.

When you start changing the default policy you can control things suchas: if a job should be removed after a period of time, even if it isrunning or only if it hasn't started running; if a job should runmultiple times even if it terminated cleanly; if a termination w/ anerror should make the job run again, be held in the queue forinspection, be removed from the queue; if a job held for inspectionshould be held forever or a specific amount of time; if a job shouldonly start running at a specific time in the future, or be run atrepeated intervals.


Best,


matt

Follow-Ups:
- Re: [Condor-users] Job Resubmission
  - From: Giles Wright

References:
- [Condor-users] Job Resubmission
  - From: Giles Wright

Prev by Date: [Condor-users] Job Resubmission
Next by Date: Re: [Condor-users] Job Resubmission
Previous by thread: [Condor-users] Job Resubmission
Next by thread: Re: [Condor-users] Job Resubmission
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Job Resubmission