[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] CM Failover with submits from CM



I have an interesting problem. I believe I detailed the setup of my organization's cluster/network in an earlier post, but I will repeat it here:

Nine compute machines (running STARTD, SCHEDD) exist on a private subnet and cannot be reached (by design) from the rest of my organization. They are connected to a switch which is also connected to the primary and secondary central managers. The primary/secondary CMs have two NICs each. Each CM has an IP on the private subnet and on my organization's public network.

Problem: I tried submitting a job from my workstation (on my organization's public network), but the CMs tell it to talk to the STARTD at a private address, which obviously doesn't work. I told my boss about this and asked how he wanted to proceed, and he wants to only allow job submissions from the CMs. This works, BUT he also wants failover capability. I don't foresee this working well since the submit machine goes down when the CM goes down, even though CM functionality fails over.

What kind of options do I have? My boss is adamant that the nodes stay on a private subnet and that we have CM failover capability. I don't think having a separate submit machine which straddles the private and public networks (like the current CMs) will work. My boss wants no single point of failure present in the system.