[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Best Practice for Reporting Execute Node Status and Configuration Issues



Dear HTCondor Community,

Iâm reaching out to ask about best practices for reporting and validating the health of execute nodes, particularly with respect to their configuration and readiness to run jobs.

We recently encountered an issue where NFS mount points on some execute nodes became unavailable. As a result, jobs entered a D (uninterruptible sleep) state because the mounts could not be accessed. Unfortunately, HTCondor still considered these nodes healthy and continued to schedule jobs on them.

This experience raised a few questions for our team:

What is the recommended way to validate execute nodes before they are allowed to run jobs?
Is there a mechanism within HTCondor to prevent jobs from being scheduled on nodes that appear healthy but are actually misconfigured or partially unavailable?
While reviewing the documentation, I found that itâs possible to use STARTD_CRON jobs to periodically check node configuration and report the results back as ClassAd attributes. Could such checks be used to automatically drain a node when a failure is detected, thereby preventing new jobs from being scheduled?

Alternatively, is there a more effective way to perform these checks at job submission, so that jobs are only dispatched to nodes that are guaranteed to be able to run them successfully?

For our specific use case, we need to ensure that certain network shares are mounted and accessible before jobs are dispatched to a node.

Any guidance on recommended approaches would be greatly appreciated.

Thank you for your time and assistance.
Gabriel Saudade