[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to handle a node where jobs are failing



The startd keeps some statistics about the number of jobs it ran and how long they ran for:

JobDurationAvg = 27.8598385
JobDurationCount = 2
JobDurationMax = 35.355968
JobDurationMin = 20.363709
JobStarts = 2
RecentJobDurationAvg = 27.8598385
RecentJobDurationCount = 2
RecentJobDurationMax = 35.355968
RecentJobDurationMin = 20.363709
RecentJobStarts = 2

The Recent numbers are for the past 20 minutes. The non-Recent numbers are over the lifetime of the condor_startd daemon.
The JobStarts numbers are always published in the slot ads. To publish the JobDuration numbers, add this to your execute machinesâ configuration:
STATISTICS_TO_PUBLISH_LIST= JobDuration

These values could be referenced in the START expression, or periodically queried by a monitoring script that sends an alert.

 - Jaime

> On May 6, 2022, at 9:12 AM, Martin Sajdl <masaj.xxx@xxxxxxxxx> wrote:
> 
> Hi All,
> 
> I need a help with the following issue:
> there was a PC/node newly connected to our pool which had high rank for many tasks (many of them was assigned to this node even the other was free as well), but the node was wrongly configured by mistake, so all the tasks was not able to run there... This combination caused that almost all the tasks were tried to be run on that node with no success again and again.
> My question is. Is there a configuration option which enables to eg. disconnect a node from the pool when N subsequent job failed there or to set the pool to try to assign the task to a different node when it failed on the first one.
> My idea is to decrease the rank of the node for the tasks - set a classad every time when a task fails and use this value in RANK formula... But maybe there is batter way I do not know about.
> 
> Thank you in advance!
> 
> Masaj
>