Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] blacklisted local host Collector
- Date: Thu, 26 Mar 2015 10:41:15 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] blacklisted local host Collector
On 3/26/2015 9:29 AM, Richard Crozier wrote:
Hello,
I'm running a personal condor pool on a machine with 64 nodes. sometimes.
condor_status -total -debug
03/26/15 13:40:25 Collector 127.0.0.1 blacklisted; skipping
I gather from other mailing list posts this means the localhost will be
skipped for an hour?
Can anyone suggest how to prevent this, or why it's happening? Can I
shorten the blacklisting time, or reset the blacklisting (condor_restart
doesn't seem to do it)?
If an HTCondor tool or daemon is attempting to query a collector and a)
that connection attempt failed, and b) it took an abnormally long period
of time to fail, then that tool or daemon will not attempt to connect
with that collector for a default of one hour. You can control the time
via config knob DEAD_COLLECTOR_MAX_AVOIDANCE_TIME ( cut-n-paste info
from section 3.3 of the HTCondor Manual is at the bottom of this email ).
As to why it is happening, that is a bigger mystery. Does it happen all
the time or only on occasion? It would appear that the collector is
failing to accept the incoming connection from condor_status fast
enough. Maybe the CollectorLog can provide some clues? Random guesses:
maybe the collector process is blocked on I/O for many seconds trying to
write (perhaps to the CollectorLog) to a volume that is NFS mounted and
currently down, or perhaps the collector is being hammered by many
simultaneous instances of condor_status running in the background, or
perhaps the collector process is CPU starved because 64 jobs are running
on the same box (in which case I'd suggest setting
JOB_RENICE_INCREMENT=10 in condor_config so that jobs run at a lower
priority than the HTCondor system services themselves), ....
I'm using the information returned by
condor_status -total in a program to determine whether I should launch
new jobs or not.
Why not just queue up thousands of jobs at once and be done with it? Ie
do a "queue 10000" in your submit file. Or if you have hundreds of
thousands/millions of jobs, you could submit them as a simple DAGMan job
and let DAGMan throttle the submissions. FWIW, DAGMan throttles
submissions not by looking at condor_status, but instead by looking at
how many jobs are idle. When too few jobs are idle, it submits new
jobs... when to many jobs are idle, it stops submitting new jobs. This
algorithm works under more situations and is simpler than looking at
machine resources and trying to figure out how many more jobs to submit.
Just food for thought.
Hope the above helps,
Todd
From the HTCondor Manual ---
DEAD_COLLECTOR_MAX_AVOIDANCE_TIME
Defines the interval of time (in seconds) between checks for a
failed primary condor_collector daemon. If connections to the dead
primary condor_collector take very little time to fail, new attempts to
query the primary condor_collector may be more frequent than the
specified maximum avoidance time. The default value equals one hour.
This variable has relevance to flocked jobs, as it defines the maximum
time they may be reporting to the primary condor_collector without the
condor_negotiator noticing.