Hi again, two other settings seem to be set to "non-standard" values, apart from the allow list. LOCAL_DIR is set to /var/condor instead of /var and CONDOR_IDS is set to 1000.1000 ( user r-admin). While that worked in the previous iteration of the cluster, jobs now switch back and forth from running to idle and vice versa. SchedLog says "SetEffectiveOwner security violation: setting owner to r-admin when active owner is "condor"" ShadowLog says "SetEffectiveOwner(r-admin) failed with errno=13: Permission denied." How do I find out which folders/files are affected and maybe have wrong permissions? There are no files owned by "condor" on the network share /clients (where the spool and FS_REMOTE dirs now reside) and none relevant on the local hard drive. The only files found by "find / -xdev -user condor" are the standard condor directories under /var, which are all empty (due to LOCAL_DIR being set to /var/condor). Kind regards Christian Hennen -----Ursprüngliche Nachricht----- Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> Im Auftrag von Hennen, Christian Gesendet: Mittwoch, 14. Oktober 2020 17:58 An: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Betreff: Re: [HTCondor-users] HTCondor high availability Hi Todd, exactly. While obviously security is important and has nothing to do with the HA setup itself, it was a surprise to me to have to configure security for the communication between the masters. That's mainly because I "inherited" this cluster and the original config contained * in the allow list, so I never experienced these type of issues. Securing the HTCondor part of the cluster is now added to my list of planned security changes :) For now, since the cluster is completely separated from the rest of the network, a working job processing and high availability of all services, was more of a priority. After changing SEC_DEFAULT_AUTHENTICATION_METHODS to FS,FS_REMOTE condor_q now works as expected and jobs can be submitted and started. They change to Idle after a while, but maybe that's not related to the HTCondor config. Kind regards Christian Hennen -----Ursprüngliche Nachricht----- Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> Im Auftrag von Todd L Miller Gesendet: Freitag, 9. Oktober 2020 22:21 An: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Betreff: Re: [HTCondor-users] HTCondor high availability > Do I need to configure any other authentication methods in addition to > all servers using LDAP via PAM ? Yes, of course. Security between different nodes has nothing to with how users log in. > I tried to set the variable as you suggested, to no avail. Master2 now > says it can't connect to master1 ("Failed to fetch ads") From your description, master1 is the original "master" node. I don't know if HAD will work for machines that are both submit nodes and central managers, but for now let's assume that it will. Note that HA instructions do NOT address security at all; that's deliberate, because security is complicated and nothing in HA changes anything about how your security should work, except the addition of another server. It's a bit more of surprise to you, perhaps, because you didn't separate your central manager from your submit server (and thus FS worked for all your client-to-daemon connections). From your serverfault question, it looks like you basically don't have any security at all -- your ALLOW lists include *, so the problem must be in authentication, not authorization. Note that condor_q, by default in recent HTCondor versions, requires authentication so that it only returns the jobs of the user who ran the command. Try running 'condor_q -all-users'; I think that will use a different command that doesn't require authentication. For this purpose, given that you know that the two masters share a filesystem and user IDs, REMOTE_FS is not a bad choice. You'll need to set SEC_DEFAULT_AUTHENTICATION_METHODS on master1 and master2 to include FS and REMOTE_FS; I would remove KERBEROS (since you're not using it). Both master1 and master2 need to set FS_REMOTE_DIR to the same value. Be sure to restart HTCondor on both machines after you've done that (I can't keep straight which configuration changes only require a reconfig). Try running condor_q again; it should work. If it doesn't, try running _CONDOR_TOOL_DEBUG=D_FULLDEBUG condor_q -debug and we'll see what we can see. - ToddM _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
Attachment:
smime.p7s
Description: S/MIME cryptographic signature