Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Collector load-balancing

Date: Tue, 21 Mar 2017 10:08:46 -0500
From: Vladimir Brik <vladimir.brik@xxxxxxxxxxxxxxxx>
Subject: [HTCondor-users] Collector load-balancing

Hello.

I'd like to ask for advice/feedback on scaling the collector of ourglidein pool.

Some background: the number of worker nodes in the pool varies widely,but probably never exceeds 12k slots. Slot lifetime can be pretty short,so there is a lot of turnover. Many worker nodes are behind NATs andfirewalls, so CCB is used. A pool password is used for authentication.Network latency is probably an issue. Lastly, our central manager is aVM with 8 virtual CPUs, and it uses shared_port.

Periodically, we've been observing spikes in numbers of log entriesabout timeouts, disconnects, ccb, shared_port failures and job restarts,so we've concluded that we need to run multiple collectors.

The plan is to followhttps://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigCollectorsto create 5 additional collectors with custom socket names (since we useshared_port) and to configure glidein worker nodes to pickCOLLECTOR_HOST and CCB_ADDRESS randomly among those 5 additional collectors.

Does anybody see potential issues with this? Or, maybe there is a betterapproach? Is there anything to be careful of?

Would it be advantageous if glideins used the same collector process forboth COLLECTOR_HOST and CCB_ADDRESS? Or, maybe it would be advantageousto use some collectors exclusively for CCB_ADDRESS and other collectorsexclusively for COLLECTOR_HOST?

I heard a few mentions of people running CCB on separate servers, but Iam not sure why. Are there advantages to this if the central manager hasidle cores and isn't running out of ports?



Thanks very much,

Vlad

Follow-Ups:
- Re: [HTCondor-users] Collector load-balancing
  - From: John M Knoeller

Prev by Date: Re: [HTCondor-users] startd doesn't start
Next by Date: Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
Previous by thread: Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp
Next by thread: Re: [HTCondor-users] Collector load-balancing
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] Collector load-balancing