Re: [HTCondor-users] Collector daemon crashing on Windows due to file descriptor limit

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hi John,

Thanks for the detailed response!

It looks like we will move to a Linux-based CM indeed, but the suggestion re: child collectors sounds very promising and will hopefully at least tide us over until the migration is complete.

Is there a rough magnitude for how many connections a Linux-based CM can support?

Kind regards,

Peet Whittaker

Discipline Lead for DevOps | Principal Software Developer

From: John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: 01 June 2022 16:11
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Collector daemon crashing on Windows due to file descriptor limit

This is a known limitation of the Windows collector, it has a maximum number of connections of 1014. (1024-10)

By default, each execute node will use 2 connections, one for the condor_master daemon, and one for the condor_startd daemon.

You can work around this by adding more collectors. In the simplest case, you can send the condor_master ads to a different collector than the condor_startd ads, this will allow your pool to grow to about 1000 execute nodes.

The more general case is a tree of collectors, with child collectors forwarding ads to a top level collector. There are instructions on how to configure that here. https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigCollectors

Or you can switch to using a Linux collector/negotiator, which has a much higher connection limit.

The

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Peet Whittaker
Sent: Tuesday, May 31, 2022 2:53 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Collector daemon crashing on Windows due to file descriptor limit

Hi,

We’re running a vanilla universe Condor pool on AWS that automatically scales up and down based on the job queue.

The pool consists of a Windows-based central manager (running the schedd, collector, negotiator and credd) and Windows-based execute nodes.

Generally everything works well. However, once the number of nodes exceeds ~500 (~3000 slots), the collector daemon starts repeatedly crashing every 10 mins (it’s quite regular).

...

04/28/22 21:34:00 Got QUERY_SCHEDD_ADS

04/28/22 21:34:00 (Sending 1 ads in response to query)

04/28/22 21:34:00 Query info: matched=1; skipped=0; query_time=0.000041; send_time=0.000105; type=Scheduler; requirements={((stricmp(Name,"ABC.XYZ.com") == 0))}; locate=1; limit=0; from=TOOL; peer=<10.0.0.252:51634>; projection={MyAddress AddressV1 CondorVersion CondorPlatform Name Machine}

04/28/22 21:34:01 MasterAd : Inserting ** "< EC2AMAZ-IO96AHI.XYZ.com >"

04/28/22 21:34:01 WARNING: cannot register TCP update socket from <10.1.1.238:50279>: file descriptor safety level exceeded: limit 1014, registered socket count 1014, fd 5364

04/28/22 21:34:12 StartdAd : Inserting ** "< slot4@xxxxxxxxxxxxxxxxxxxxxxx , 10.1.1.238 >"

04/28/22 21:34:12 StartdPvtAd : Inserting ** "< slot4@xxxxxxxxxxxxxxxxxxxxxxx , 10.1.1.238 >"

04/28/22 21:34:12 WARNING: cannot register TCP update socket from <10.1.1.238:50291>: file descriptor safety level exceeded: limit 1014, registered socket count 1014, fd 5220

04/28/22 21:34:20 MasterAd : Inserting ** "< EC2AMAZ-KOU1A4V.XYZ.com >"

04/28/22 21:34:20 WARNING: cannot register TCP update socket from <10.1.4.192:59370>: file descriptor safety level exceeded: limit 1014, registered socket count 1015, fd 5368

04/28/22 21:34:20 MasterAd : Inserting ** "< EC2AMAZ-G6I727N.XYZ.com >"

04/28/22 21:34:20 WARNING: cannot register TCP update socket from <10.1.0.50:56786>: file descriptor safety level exceeded: limit 1014, registered socket count 1016, fd 5348

04/28/22 21:34:20 ERROR "Selector::add_fd(): read fd_set is full" at line 261 in file C:\condor\execute\dir_6408\sources\src\condor_utils\selector.cpp

04/28/22 21:34:30 ******************************************************

04/28/22 21:34:30 ** condor_collector.exe (CONDOR_COLLECTOR) STARTING UP

...

Restarting the central manager doesn’t help. The central manager also doesn’t seem to be under any particular memory or CPU pressure.

Any pointers/ideas on how to fix this would be greatly appreciated!

Relevant Condor version info:

$CondorVersion: 8.8.12 Nov 24 2020 BuildID: 524104 $

$CondorPlatform: x86_64_Windows10 $

Kind regards,

Peet Whittaker

Discipline Lead for DevOps | Principal Software Developer

JBA Consulting, 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire, BD23 3FD. Telephone: +441756699500

Visit our new website at www.jbaconsulting.com.

This email is covered by the JBA Consulting email disclaimer
JBA Consulting is a trading name of Jeremy Benn Associates Limited, registered in England, company number 03246693, 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire, BD23 3FD.

Image removed by sender. JBA CONSULTING

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Collector daemon crashing on Windows due to file descriptor limit