Hi John, Thanks for the detailed response! It looks like we will move to a Linux-based CM indeed, but the suggestion re: child collectors sounds very promising and will hopefully at least tide us over until the migration is complete. Is there a rough magnitude for how many connections a Linux-based CM can support? Kind regards, Peet Whittaker Discipline Lead for DevOps | Principal Software Developer From: John M Knoeller <johnkn@xxxxxxxxxxx>
This is a known limitation of the Windows collector, it has a maximum number of connections of 1014. (1024-10) By default, each execute node will use 2 connections, one for the condor_master daemon, and one for the condor_startd daemon.
You can work around this by adding more collectors. In the simplest case, you can send the condor_master ads to a different collector than the condor_startd ads, this will allow your pool to grow to about 1000 execute
nodes. The more general case is a tree of collectors, with child collectors forwarding ads to a top level collector. There are instructions on how to configure that here.
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigCollectors Or you can switch to using a Linux collector/negotiator, which has a much higher connection limit.
The From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Peet Whittaker Hi, We’re running a vanilla universe Condor pool on AWS that automatically scales up and down based on the job queue. The pool consists of a Windows-based central manager (running the schedd, collector, negotiator and credd) and Windows-based execute nodes. Generally everything works well. However, once the number of nodes exceeds ~500 (~3000 slots), the collector daemon starts repeatedly crashing every 10 mins (it’s quite regular). ... 04/28/22 21:34:00 Got QUERY_SCHEDD_ADS 04/28/22 21:34:00 (Sending 1 ads in response to query) 04/28/22 21:34:00 Query info: matched=1; skipped=0; query_time=0.000041; send_time=0.000105; type=Scheduler; requirements={((stricmp(Name,"ABC.XYZ.com") == 0))}; locate=1; limit=0; from=TOOL;
peer=<10.0.0.252:51634>; projection={MyAddress AddressV1 CondorVersion CondorPlatform Name Machine} 04/28/22 21:34:01 MasterAd : Inserting ** "< EC2AMAZ-IO96AHI.XYZ.com >" 04/28/22 21:34:01 WARNING: cannot register TCP update socket from <10.1.1.238:50279>: file descriptor safety level exceeded: limit 1014, registered socket count 1014, fd 5364 04/28/22 21:34:12 StartdAd : Inserting ** "<
slot4@xxxxxxxxxxxxxxxxxxxxxxx , 10.1.1.238 >" 04/28/22 21:34:12 StartdPvtAd : Inserting ** "<
slot4@xxxxxxxxxxxxxxxxxxxxxxx , 10.1.1.238 >" 04/28/22 21:34:12 WARNING: cannot register TCP update socket from <10.1.1.238:50291>: file descriptor safety level exceeded: limit 1014, registered socket count 1014, fd 5220 04/28/22 21:34:20 MasterAd : Inserting ** "< EC2AMAZ-KOU1A4V.XYZ.com >" 04/28/22 21:34:20 WARNING: cannot register TCP update socket from <10.1.4.192:59370>: file descriptor safety level exceeded: limit 1014, registered socket count 1015, fd 5368 04/28/22 21:34:20 MasterAd : Inserting ** "< EC2AMAZ-G6I727N.XYZ.com >" 04/28/22 21:34:20 WARNING: cannot register TCP update socket from <10.1.0.50:56786>: file descriptor safety level exceeded: limit 1014, registered socket count 1016, fd 5348 04/28/22 21:34:20 ERROR "Selector::add_fd(): read fd_set is full" at line 261 in file C:\condor\execute\dir_6408\sources\src\condor_utils\selector.cpp 04/28/22 21:34:30 ****************************************************** 04/28/22 21:34:30 ** condor_collector.exe (CONDOR_COLLECTOR) STARTING UP ... Restarting the central manager doesn’t help. The central manager also doesn’t seem to be under any particular memory or CPU pressure. Any pointers/ideas on how to fix this would be greatly appreciated! Relevant Condor version info: $CondorVersion: 8.8.12 Nov 24 2020 BuildID: 524104 $ $CondorPlatform: x86_64_Windows10 $ Kind regards, Peet Whittaker Discipline Lead for DevOps | Principal Software Developer JBA Consulting, 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire, BD23 3FD. Telephone: +441756699500
|