On 10/05/2012 07:22 AM, Giles Wright wrote:
Hi - Just getting started with Condor. We're looking at running jobs on a local network of Linux machines with the potential of adding in a connection to an Amazon VPC in the future. Have set up a couple of virtual condor pools at 2 separate locations and have configured flocking between them. My question right now is how Condor deals with resubmitting failed jobs. For example, should a machine die during execution a job, it seems that the job terminates as you would expect, however we would like any failed jobs to be resubmitted to another available machine in the pool. Should we be looking at Condor-G (not started there yet) and condor_resubmit? Thanks, Giles
Giles,Condor provides more functionality around the resubmission use case than most other schedulers. And the default policy is setup in such a way that most Condor folks don't ever think about "resubmission."
Condor will keep your job in the queue (condor_schedd managed) until the policy attached to the job says otherwise.
The default policy says a job will be run as many time as necessary for the job to terminate. So if the machine a job is running on crashes (generally, becomes unavailable), the condor_schedd will automatically try to run the job on another machine.
When you start changing the default policy you can control things such as: if a job should be removed after a period of time, even if it is running or only if it hasn't started running; if a job should run multiple times even if it terminated cleanly; if a termination w/ an error should make the job run again, be held in the queue for inspection, be removed from the queue; if a job held for inspection should be held forever or a specific amount of time; if a job should only start running at a specific time in the future, or be run at repeated intervals.
Best, matt