Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] OPINIONS WANTED: Are there any blatent downsides I am missing to the following Condor configuration

Date: Fri, 26 Jun 2020 00:30:55 -0500
From: Gregory Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] OPINIONS WANTED: Are there any blatent downsides I am missing to the following Condor configuration


On 6/25/20 6:21 PM, Wesley Taylor wrote:

Hey!

I am architecting our final HTCondor configuration over here and I have an idea I am unsure about and I would like to ask some experienced users for their opinion.

Background, we have a small, relatively homogenous cluster (with no special universes) and less than 10 users. Since each user has their own workstation separate from our cluster I thought the following configuration would suit our needs, but I want to make sure there isn't a huge disadvantage I am missing:

1. Set the Central Manager to be highly available to the point of tolerating N cluster machine failures
2. Put a Submit on each of the users' workstations (I am a little worried about the resource usage of the condor_shadow and condor_schedd, my users are already running into RAM consumption issues over time as it is)
3. Place an Execute on each of the cluster machines, which would lead to the central manager being on a machine that is also executing jobs

Hello Wes:

If you want my expert opinion, I can clearly and without reservation say"It depends".

Keep in mind the Central Manager has only soft state, that is, it can bedown for a while and no running job will get interrupted. The scheddscan even reuse matched machines the crashed negotiator had given outbefore the crash, and can thus complete existing jobs and start new oneswhen the CM is down.Â I would say that most sites don't employ heroicmethods to get high availability on their central managers.Â Those sitesthat do tend to be ones where the execute machines are distributedacross the planet, and rely on trans-oceanic networks that sometimesbecome unavailable for long periods of time.

Generally speaking, putting the schedds closer to their users is a greatidea -- this enables one to "submit locally, and run globally".Â Whilethe central manager is stateless, typically the most performancesensitive component in a condor system is the scheduler, so having moreof them is a good idea.Â You don't mention how many jobs any one scheddis expected to run, but we work very hard to keep the shadowslightweight, and a healthy modern desktop can support running tens ofthousands of shadows. Perhaps deploying HTCondorÂ will allow your usersto move their memory-heavy jobs off of their local machines and out ontoyour cluster?Â Note that if the schedd or the machine it runs oncrashes, and is down for an extended period of time, the worker nodesrunning that schedd's jobs will detect that, and evict those jobs fromtheir machine.Â When the schedd restarts, condor will restart these jobsfrom their beginning.Â If the schedd comes back quickly enough, it willreconnect to the running jobs, and nothing will get evicted.

Depending on the size of the cluster, we would typically recommend notsetting up the central manager to also execute jobs.Â If you run withhost-based security (the default), this can lead to security problems inthe default configuration.Â For a very small cluster with a handful ofworker nodes, the few extra cores you gain by allowing execution on thecentral manager may be worthwhile, but as you add worker nodes, youreally want a dedicated machine for your CM.


Good luck,

-greg

References:
- [HTCondor-users] OPINIONS WANTED: Are there any blatent downsides I am missing to the following Condor configuration
  - From: Wesley Taylor

Prev by Date: [HTCondor-users] OPINIONS WANTED: Are there any blatent downsides I am missing to the following Condor configuration
Next by Date: [HTCondor-users] Change submit parameter configuration knob with python bindings
Previous by thread: [HTCondor-users] OPINIONS WANTED: Are there any blatent downsides I am missing to the following Condor configuration
Next by thread: [HTCondor-users] Change submit parameter configuration knob with python bindings
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] OPINIONS WANTED: Are there any blatent downsides I am missing to the following Condor configuration