Subject: [HTCondor-users] Excluding execute nodes after multiple job failures
Hi all,
I'm just wondering if there is any way of excluding nodes from the pool of available nodes if a certain number of submitted jobs have failed on the node within a given time. This is something I've experienced a few times, either due to a node missing some packages, or an issue with the node etc. In these cases, jobs submitted to the offending node will fail, and then immediately be re-submitted to the same node. This can easily results in a larger number of jobs being marked as failed after using all the retrys.
Thanks, Duncan
--
==========================
Duncan Meacher, PhD Postdoctoral Researcher Institute for Gravitation and the Cosmos Department of Physics Pennsylvania State University 104 Davey Lab #040 University Park, PA 16802 Tel: +1 814 865 3243 ==========================