Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] How to handle a node where jobs are failing
- Date: Fri, 06 May 2022 16:12:50 +0200
- From: Martin Sajdl <masaj.xxx@xxxxxxxxx>
- Subject: [HTCondor-users] How to handle a node where jobs are failing
Hi All,
I need a help with the following issue:
there was a PC/node newly connected to our pool which had high rank for
many tasks (many of them was assigned to this node even the other was
free as well), but the node was wrongly configured by mistake, so all
the tasks was not able to run there... This combination caused that
almost all the tasks were tried to be run on that node with no success
again and again.
My question is. Is there a configuration option which enables to eg.
disconnect a node from the pool when N subsequent job failed there or to
set the pool to try to assign the task to a different node when it
failed on the first one.
My idea is to decrease the rank of the node for the tasks - set a
classad every time when a task fails and use this value in RANK
formula... But maybe there is batter way I do not know about.
Thank you in advance!
Masaj