|
Dear HTCondor Community,
Iâm reaching out to ask about best practices for reporting and
validating the health of execute nodes, particularly with respect
to their configuration and readiness to run jobs.
We recently encountered an issue where NFS mount points on some
execute nodes became unavailable. As a result, jobs entered a D
(uninterruptible sleep) state because the mounts could not be
accessed. Unfortunately, HTCondor still considered these nodes
healthy and continued to schedule jobs on them.
This experience raised a few questions for our team:
What is the recommended way to validate execute nodes before they
are allowed to run jobs?
Is there a mechanism within HTCondor to prevent jobs from being
scheduled on nodes that appear healthy but are actually
misconfigured or partially unavailable?
While reviewing the documentation, I found that itâs possible to
use STARTD_CRON jobs to periodically check node configuration and
report the results back as ClassAd attributes. Could such checks
be used to automatically drain a node when a failure is detected,
thereby preventing new jobs from being scheduled?
Alternatively, is there a more effective way to perform these
checks at job submission, so that jobs are only dispatched to
nodes that are guaranteed to be able to run them successfully?
For our specific use case, we need to ensure that certain network
shares are mounted and accessible before jobs are dispatched to a
node.
Any guidance on recommended approaches would be greatly
appreciated.
Thank you for your time and assistance.
Gabriel Saudade
|