Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] SharedPortServer: server was busy
- Date: Mon, 15 Feb 2016 14:40:38 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] SharedPortServer: server was busy
On 2/15/2016 8:45 AM, Vladimir Brik wrote:
Hello.
SharedPortLog file on our central manager has a lot of entries like:
SharedPortServer: server was busy, failed to connect to collector as
requested by <172.16.223.61:40500>: Resource temporarily unavailable
(err=11)
Sometimes, I see hundreds of such messages generated per second every
few minutes.
Is the problem that the collector doesn't respond quickly enough, or
that shared_port can't handle the volume of connections, or something else?
It is the first case you mention - the problem is that the shared_port
tried to forward the connection to the collector, but the collector's
listen queue is full because the collector is not responsive enough.
Are there any configuration tweaks I could try to alleviate this?
What version of HTCondor are you running (always a good idea to let us
know...) ?
A while back we did fix a bug where the collector would periodically
pause when it was configured to use shared_port. I think this was
ultimately fixed in v8.4.4+ in stable series or v8.5.2+ in developer. If
this is the problem, then simply upgrading should fix it, or (if you
cannot upgrade for some reason) turning off shared port via
USE_SHARED_PORT=False. This would be my first guess, esp if your
collector seemed to be doing just fine before you started using it in
conjunction with shared_port.
But another possibility is your collector is simply overloaded. Some
possible problems with pithy solutions -
Q: Do you use strong authentication (SSL, GSI, etc) to your collector,
esp if you have execute nodes spread out over wide-area connections
(i.e. high latency networks) ? A: Consider horizontally scaling the
collector as described here:
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigCollectors
Q: Do you have a lot (thousands) of slots behind private networks and
thus need to use CCB? A: Consider running additional instances of the
condor_collector just to handle CCB requests, separate from your central
manager collector
Q: Do you have a lot of users or monitoring scripts constantly running
condor_status ? A: Consider increasing COLLECTOR_QUERY_WORKERS setting
in your central manager condor_config to gain increased collector query
performance at the cost of greater memory usage.
Hope the above helps,
Todd