[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Computers missing from Condor pool



Yes, you would need to increase the per-process limit on file 
descriptors.  You can do that in the condor init script, for example.
I am not aware of anybody who has tested the scalability of the 
collector with ~10,000 open TCP connections.  Up until Condor 6.9.3, 
Condor would fail with more than 1024, so don't try it in 6.8.  In the 
worst case, if there are scalability problems in the collector due to 
number of open connections, you would be able to work around the problem 
by having a bank of N collectors with each execute machine configured to 
report to just one.  Then (in 7.0.1) you can configure these collectors 
to forward ClassAds to a single collector that is used for matchmaking 
purposes.  This forwarding happens via UDP, but since it would be Linux 
to Linux, you shouldn't suffer from the Windows UDP problem.
Of course, the real solution is for Condor to work around the Windows 
UDP problem if at all possible.  I hope this will addressed soon.
--Dan

Rob de Graaf wrote:

Hi Erik,

Thank you for your reply. I've been wary of changing to TCP because of the warnings in condor_config and the manual, as well as the effect it might have on network / system load, but I'm willing to explore this option further.
From the manual, I understand I need to set COLLECTOR_SOCKET_CACHE_SIZE 
to the number of machines in the pool, multiplied by the number of 
daemons per machine, and the collector process will need to be able to 
manage at least that many file descriptors. In our case, this means the 
collector would need at least 10.000 file descriptors.
The default OS-wide limit on file descriptors seems high enough at 
206.151, but the default per-process limit on file descriptors in Linux 
seems to be 1024, so to enable TCP updates I'd have to increase that by 
a factor 10.. is that a safe thing to do?
Regards,

Rob de Graaf

Erik Paulson wrote:
On Tue, Feb 26, 2008 at 03:34:39PM +0100, Rob de Graaf wrote:
The suggested fix, adding a delay by setting the D_NETWORK debug flag, has been applied on all computers and has had some effect; the average pool size has gone up, but not by as much as we had hoped, and ping sweeps still reveal many more live machines not appearing in the pool, leading us to believe there is still some other problem.
We've looked at master and startd log files but we haven't been able to 
find anything seriously wrong, and we're running out of ideas.
What could be causing computers to sometimes be missing from our pool, 
and what else can we do to find them?
     

Turn on TCP updates to the collector, instead of UDP.

-Erik

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/