[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] OPINIONS WANTED: Are there any blatent downsides I am missing to the following Condor configuration




On 6/25/20 6:21 PM, Wesley Taylor wrote:
Hey!

I am architecting our final HTCondor configuration over here and I have an idea I am unsure about and I would like to ask some experienced users for their opinion.

Background, we have a small, relatively homogenous cluster (with no special universes) and less than 10 users. Since each user has their own workstation separate from our cluster I thought the following configuration would suit our needs, but I want to make sure there isn't a huge disadvantage I am missing:

1. Set the Central Manager to be highly available to the point of tolerating N cluster machine failures
2. Put a Submit on each of the users' workstations (I am a little worried about the resource usage of the condor_shadow and condor_schedd, my users are already running into RAM consumption issues over time as it is)
3. Place an Execute on each of the cluster machines, which would lead to the central manager being on a machine that is also executing jobs

Hello Wes:

If you want my expert opinion, I can clearly and without reservation say "It depends".

Keep in mind the Central Manager has only soft state, that is, it can be down for a while and no running job will get interrupted. The schedds can even reuse matched machines the crashed negotiator had given out before the crash, and can thus complete existing jobs and start new ones when the CM is down. I would say that most sites don't employ heroic methods to get high availability on their central managers. Those sites that do tend to be ones where the execute machines are distributed across the planet, and rely on trans-oceanic networks that sometimes become unavailable for long periods of time.

Generally speaking, putting the schedds closer to their users is a great idea -- this enables one to "submit locally, and run globally". While the central manager is stateless, typically the most performance sensitive component in a condor system is the scheduler, so having more of them is a good idea. You don't mention how many jobs any one schedd is expected to run, but we work very hard to keep the shadows lightweight, and a healthy modern desktop can support running tens of thousands of shadows. Perhaps deploying HTCondor will allow your users to move their memory-heavy jobs off of their local machines and out onto your cluster? Note that if the schedd or the machine it runs on crashes, and is down for an extended period of time, the worker nodes running that schedd's jobs will detect that, and evict those jobs from their machine. When the schedd restarts, condor will restart these jobs from their beginning. If the schedd comes back quickly enough, it will reconnect to the running jobs, and nothing will get evicted.

Depending on the size of the cluster, we would typically recommend not setting up the central manager to also execute jobs. If you run with host-based security (the default), this can lead to security problems in the default configuration. For a very small cluster with a handful of worker nodes, the few extra cores you gain by allowing execution on the central manager may be worthwhile, but as you add worker nodes, you really want a dedicated machine for your CM.

Good luck,

-greg