[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Collector load-balancing
Hello.
I'd like to ask for advice/feedback on scaling the collector of our
glidein pool.
Some background: the number of worker nodes in the pool varies widely,
but probably never exceeds 12k slots. Slot lifetime can be pretty short,
so there is a lot of turnover. Many worker nodes are behind NATs and
firewalls, so CCB is used. A pool password is used for authentication.
Network latency is probably an issue. Lastly, our central manager is a
VM with 8 virtual CPUs, and it uses shared_port.
Periodically, we've been observing spikes in numbers of log entries
about timeouts, disconnects, ccb, shared_port failures and job restarts,
so we've concluded that we need to run multiple collectors.
The plan is to follow
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigCollectors
to create 5 additional collectors with custom socket names (since we use
shared_port) and to configure glidein worker nodes to pick
COLLECTOR_HOST and CCB_ADDRESS randomly among those 5 additional collectors.
Does anybody see potential issues with this? Or, maybe there is a better
approach? Is there anything to be careful of?
Would it be advantageous if glideins used the same collector process for
both COLLECTOR_HOST and CCB_ADDRESS? Or, maybe it would be advantageous
to use some collectors exclusively for CCB_ADDRESS and other collectors
exclusively for COLLECTOR_HOST?
I heard a few mentions of people running CCB on separate servers, but I
am not sure why. Are there advantages to this if the central manager has
idle cores and isn't running out of ports?
Thanks very much,
Vlad