Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] [Condor-users] Job Resubmit
- Date: Mon, 30 Jun 2014 13:31:52 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] [Condor-users] Job Resubmit
On 6/30/2014 2:24 AM, Sunshine wrote:
> I submit some jobs.
> A few of jobs took 2 hours to complete, but I think the time should be 20m and some similar jobs indeed finished within 20minutes.
> I think something wrong with my jobs or clusters..
>
>
> My question: how do I let a job restart after a specific time?
> For example, if a job didn't finish within 5 minutes, then let the job resubmit?ãor restart on a different machine? â
>
See the below submit file I wrote as an exmaple to do the above.
Hopefully self-explanatory, esp if you take a look at the condor_submit manual
page for the expressions I used below - see
http://research.cs.wisc.edu/htcondor/manual/current/condor_submit.html
Feel free to ask any questions.
Hope this helps,
Todd
# Fill in executable and max expected runtime in minutes.
# If the job runs longer than expected, it will go on hold,
# and then will be restarted on a different machine. After
# three restarts on three different machines, the job will
# stay on hold.
#
executable = foo
expected_runtime_minutes = 5
#
# Should not need to change the below...
#
job_machine_attrs = Machine
job_machine_attrs_history_length = 4
requirements = target.machine =!= MachineAttrMachine1 && \
target.machine =!= MachineAttrMachine2 && \
target.machine =!= MachineAttrMachine3
periodic_hold = JobStatus == 2 && \
CurrentTime - EnteredCurrentStatus > 60 * $(expected_runtime_minutes)
periodic_hold_subcode = 1
periodic_release = HoldReasonCode == 3 && HoldReasonSubCode == 1 && \
JobRunCount < 3
periodic_hold_reason = ifthenelse(JobRunCount<3,"Ran too long, will retry","Ran too long")
queue