On 5/22/2017 6:45 AM, lejeczek wrote:
hi fellas
I've only started looking at htcondor, not having a good
understanding of it yet I wonder - htcondor has that
concept of "central manager" and I wonder if this makes
it a valid candidate for HA setup?
Does anybody have any experience with/thoughts on
htcondor as HA and could share it here?
many thanks
L.
Hi,
First off, understand that if your installations central
manager dies, currently running jobs will continue to run
and even new jobs will continue to get scheduled in many
cases (i.e. new jobs will still get scheduled to claimed
slots). Even in production pools, most sites have no
problem with rebooting their central manager or even
taking it down for an hour or two - while the central
manger is down, users may notice that condor_status stops
working, but practically all other common tools continue
to work (condor_submit, condor_q, condor_rm, etc). Thus
many pools don't ever bother with an HA solution for the
central manager.
If you are still concerned, the HTCondor central manager
is actually very lightweight and holds very little state
(just user prioirties), and this is very amenable to a
high availability (HA) setup. You essentially have two
choices:
1. HTCondor can be configured to have two central managers
(hot/hot), and automatically fail over as needed. See the
section in the HTCondor Manual titled "High Availability
of the Central Manger" at
http://research.cs.wisc.edu/htcondor/manual/v8.6/3_13High_Availability.html#SECTION004132200000000000000
2. If you already run your services in a managed
visualized setup (Mesos+Marathan, OpenStack, vSphere,
HyperV, etc) that supports failover, you could setup your
HTCondor central manager for HA leveraging those
environments, i.e. same way you would setup a redundant
email server, for instance.
Hope the above helps
Todd