[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Distinguish preemption from crash
- Date: Mon, 11 May 2009 22:37:05 +0200
- From: Horvátth Szabolcs <szabolcs@xxxxxxxxxxxxx>
- Subject: [Condor-users] Distinguish preemption from crash
Hi,
We have some jobs that tend to fail once in a while because of temporary
memory / disk / network issues. Restarting the jobs usually solve the
problem
but sometimes there are issues that make a job always crash, so
restarting it unlimited times is just a waste of processors.
When not having preemption enabled we used to limit the restart limit
(by using an on exit hold expression after n restarts) but
enabling preemption caused lots of problems since - from the job run
count classad variable - there is no difference between preemption
and a software problem and preemptions made reaching this restart limit
quite fast.
What would you suggest doing to get around this problem? Can I somehow
substract the number of preemptions from the job run count?
Or should I add a custom attribute to count just the software crashes
based on the return values?
Cheers,
Szabolcs