I submit some jobs.
A few of jobs took 2 hours to complete, but I think the time should be 20m and some similar jobs indeed finished within 20minutes.
I think something wrong with my jobs or clusters..
My question: how do I let a job restart after a specific time?
For example, if a job didn't finish within 5 minutes, then let the job resubmit?ãor restart on a different machine?
For example, i submit 100 jobs, then 99 jobs finished within 20m, but a job cost 2hours long , i want to resubmit a job.
I used following:
periodic_remove = (CurrentTime - EnteredCurrentStatus >60*20)âââ
then check the log, then submit the failed job.
Any better ideas?