Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] misconfigured node
On Sat, May 24, 2014 at 7:45 AM, Rita <rmorgan466@xxxxxxxxx> wrote:
> i know a user can setup a "Blackhole" policy but I was wondering if there is
> something I can do on the startd side to avoid black holes. Would it be
> possible to run a test to see if the blackhole problem is occurring?
>
It's certainly possible, but I don't know how practical it would be.
Assuming you know what the root cause of the black hole state is (for
example, I've seen it happen when NFS mounts hang on the execute
node), you could write a test that runs as a startd cron. The START
expression on the execute node could then take that into account.
For example, if you know it's because of bad NFS mounts, you can have
your test run the mount command and if it doesn't return within N
seconds, it would publish NODE_CHECK_MOUNTS = False. A basic start
expression for a dedicated execute node would basically be START =
$(NODE_CHECK_MOUNTS)
I'd be interested in hearing what leads to a black hole state. I
started a wiki page
(https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=BlackHoleConditions)
for people to document these conditions as they find them.
For those unaware of the blackhole policy referred to above, see:
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=AvoidingBlackHoles
Thanks,
BC
--
Ben Cotton
main: 888.292.5320
Cycle Computing
Leader in Utility HPC Software
http://www.cyclecomputing.com
twitter: @cyclecomputing