[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor high availability



Hi!

The problem seems to be unrelated to any folder permissions or other authentication mechanisms in the system and rather be located in the Condor authentication mechanism:

04/22/21 23:50:25 (fd:19) (pid:2200) (D_COMMAND) Calling HandleReq <handle_q> (0) for command 1112 (QMGMT_WRITE_CMD) from condor@child <192.168.1.21:22171>
04/22/21 23:50:25 (fd:19) (pid:2200) (D_SYSCALLS) Got request #10030
04/22/21 23:50:25 (fd:19) (pid:2200) (D_ALWAYS) SetEffectiveOwner security violation: setting owner to testuser when active owner is "condor"

Condor can't change the user to 'testuser' while performing the final update to the job queue. You'll find the complete log output, which I obtained from SchedLog on highest verbosity, below.  I presume condor@child and condor@parent are internal names for processes. But shouldn't they authenticate via "FS" or "FS_REMOTE"? Is there anything missing in the configuration?

Does anybody got any idea how to fix this? Any help would be greatly appreciated!

Kind regards
Christian Hennen

04/22/21 23:50:25 (fd:19) (pid:2200) (D_SECURITY) DC_AUTHENTICATE: received DC_AUTHENTICATE from <192.168.1.21:22171>
04/22/21 23:50:25 (fd:19) (pid:2200) (D_SECURITY) DC_AUTHENTICATE: received following ClassAd:
Enact = "YES"
ParentUniqueID = "master1:2200:1619090561"
Encryption = "NO"
Authentication = "NO"
SessionLease = 3600
OutgoingNegotiation = "REQUIRED"
CryptoMethods = "3DES"
Integrity = "NO"
AuthMethods = "FS, FS_REMOTE"
Sid = "be414c0297be797745f4ebd5b3fae3e8af918d8b562a0af0"
UseSession = "YES"
TriedAuthentication = true
ServerCommandSock = "<192.168.1.21:9618?addrs=192.168.1.21-9618&noUDP&sock=2200_d953_5964>"
User = "condor@parent"
ServerPid = 23388
Subsystem = "SHADOW"
SessionDuration = "86400"
Command = 1112
ValidCommands = "60000,60008,60026,60017,510,71003,60004,60012,60021,421,478,480,486,488,489,487,499,502,464,1112,481,509,511,74000,507,60007,457,60020,443,441,6,12,515,516,519,1111,471"
RemoteVersion = "$CondorVersion: 8.6.8 Jan 10 2020 BuildID: Debian-8.6.8~dfsg.1-2ubuntu1 Debian-8.6.8~dfsg.1-2ubuntu1 $"
04/22/21 23:50:25 (fd:19) (pid:2200) (D_SECURITY) DC_AUTHENTICATE: resuming session id be414c0297be797745f4ebd5b3fae3e8af918d8b562a0af0:
04/22/21 23:50:25 (fd:19) (pid:2200) (D_SECURITY) DC_AUTHENTICATE: Cached Session:
ValidCommands = "60000,60008,60026,60017,510,71003,60004,60012,60021,421,478,480,486,488,489,487,499,502,464,1112,481,509,511,74000,507,60007,457,60020,443,441,6,12,515,516,519,1111,471"
SessionDuration = "86400"
Subsystem = "SCHEDD"
ServerPid = 2200
Enact = "YES"
ParentUniqueID = "master1:2123:1619090549"
Encryption = "NO"
Authentication = "NO"
SessionLease = 3600
OutgoingNegotiation = "REQUIRED"
CryptoMethods = "3DES"
Integrity = "NO"
AuthMethods = "FS, FS_REMOTE"
Sid = "be414c0297be797745f4ebd5b3fae3e8af918d8b562a0af0"
UseSession = "YES"
TriedAuthentication = true
User = "condor@child"
04/22/21 23:50:25 (fd:19) (pid:2200) (D_SECURITY) DC_AUTHENTICATE: not authenticating.
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE) DAEMONCORE: EnableCrypto()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE) DAEMONCORE: VerifyCommand()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_SECURITY) DC_AUTHENTICATE: Success.
04/22/21 23:50:25 (fd:19) (pid:2200) (D_ALWAYS) PERMISSION GRANTED to condor@child from host 192.168.1.21 for command 1112 (QMGMT_WRITE_CMD), access level WRITE: reason: WRITE authorization has been made automatic for condor@child
04/22/21 23:50:25 (fd:19) (pid:2200) (D_AUDIT) Command=QMGMT_WRITE_CMD, peer=<192.168.1.21:22171>
04/22/21 23:50:25 (fd:19) (pid:2200) (D_AUDIT) AuthMethod=FS, FS_REMOTE, AuthId=(null), CondorId=condor@child
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE) DAEMONCORE: SendResponse()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE) DAEMONCORE: SendResponse() : NOT m_new_session
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE) DAEMONCORE: ExecCommand(m_req == 1112, m_real_cmd == 1112, m_auth_cmd == 1112)
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE:2) selector 0x7ffdee0f8510 resetting
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE:2) selector 0x7ffdee0f8370 resetting
04/22/21 23:50:25 (fd:19) (pid:2200) (D_NETWORK) condor_read(fd=16 <192.168.1.21:22171>,,size=5,timeout=20,flags=0,non_blocking=1)
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE:2) selector 0x7ffdee0f8330 resetting
04/22/21 23:50:25 (fd:19) (pid:2200) (D_NETWORK) condor_read(fd=16 <192.168.1.21:22171>,,size=16,timeout=20,flags=0,non_blocking=1)
04/22/21 23:50:25 (fd:19) (pid:2200) (D_COMMAND) Calling HandleReq <handle_q> (0) for command 1112 (QMGMT_WRITE_CMD) from condor@child <192.168.1.21:22171>
04/22/21 23:50:25 (fd:19) (pid:2200) (D_SYSCALLS) Got request #10030
04/22/21 23:50:25 (fd:19) (pid:2200) (D_ALWAYS) SetEffectiveOwner security violation: setting owner to testuser when active owner is "condor"
04/22/21 23:50:25 (fd:19) (pid:2200) (D_NETWORK) condor_write(fd=16 <192.168.1.21:22171>,,size=21,timeout=20,flags=0,non_blocking=0)
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE:2) selector 0x7ffdee0f8170 resetting
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE:2) selector 0x7ffdee0f8170 adding fd 16 (socket:[413300])
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE:2) selector 0x7ffdee0f8170 adding fd 16 (socket:[413300])
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE:2) selector 0x7ffdee0f8170 adding fd 16 (socket:[413300])
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE:2) selector 0x7ffdee0f8170 adding fd 16 (socket:[413300])
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Entering thread safe start [select] in selector.cpp:322 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Leaving thread safe start [select] in selector.cpp:322 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Entering thread safe stop [select] in selector.cpp:336 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Leaving thread safe stop [select] in selector.cpp:336 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Entering thread safe start [send] in condor_rw.cpp:509 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Leaving thread safe start [send] in condor_rw.cpp:509 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Entering thread safe stop [send] in condor_rw.cpp:521 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Leaving thread safe stop [send] in condor_rw.cpp:521 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_SYSCALLS) 	SetEffectiveOwner
04/22/21 23:50:25 (fd:19) (pid:2200) (D_SYSCALLS) 	authenticated user = 'condor@child'
04/22/21 23:50:25 (fd:19) (pid:2200) (D_SYSCALLS) 	requested owner = 'testuser'
04/22/21 23:50:25 (fd:19) (pid:2200) (D_SYSCALLS) 	rval -1, errno 13
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE:2) selector 0x7ffdee0f81e0 resetting
04/22/21 23:50:25 (fd:19) (pid:2200) (D_NETWORK) condor_read(fd=16 <192.168.1.21:22171>,,size=5,timeout=20,flags=0,non_blocking=0)
04/22/21 23:50:25 (fd:19) (pid:2200) (D_DAEMONCORE:2) selector 0x7ffdee0f81e0 adding fd 16 (socket:[413300])
04/22/21 23:50:25 (fd:19) (pid:2200) (D_NETWORK) condor_read(): fd=16
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Entering thread safe start [select] in selector.cpp:322 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Leaving thread safe start [select] in selector.cpp:322 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Entering thread safe stop [select] in selector.cpp:336 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Leaving thread safe stop [select] in selector.cpp:336 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_NETWORK) condor_read(): select returned 1
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Entering thread safe start [recv] in condor_rw.cpp:227 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Leaving thread safe start [recv] in condor_rw.cpp:227 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Entering thread safe stop [recv] in condor_rw.cpp:239 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_THREADS) Leaving thread safe stop [recv] in condor_rw.cpp:239 unknown()
04/22/21 23:50:25 (fd:19) (pid:2200) (D_ALWAYS:2) condor_read(): Socket closed when trying to read 5 bytes from <192.168.1.21:22171>
04/22/21 23:50:25 (fd:19) (pid:2200) (D_ALWAYS:2) IO: EOF reading packet header
04/22/21 23:50:25 (fd:19) (pid:2200) (D_NETWORK) Stream::get(int) failed to read padding
04/22/21 23:50:25 (fd:19) (pid:2200) (D_ALWAYS:2) QMGR Connection closed
04/22/21 23:50:25 (fd:19) (pid:2200) (D_COMMAND) Return from HandleReq <handle_q> (handler: 0.001590s, sec: 0.002s, payload: 0.000s)
04/22/21 23:50:25 (fd:19) (pid:2200) (D_NETWORK) CLOSE TCP <192.168.1.21:30017> fd=16

> -----UrsprÃngliche Nachricht-----
> Von: Hennen, Christian
> Gesendet: Donnerstag, 14. Januar 2021 18:07
> An: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Betreff: AW: [HTCondor-users] HTCondor high availability
> 
> Hi!
> 
> Currently jobs are being submitted and executed like they should, but they
> don't complete because of the problems described in my last e-mail. I
> checked every Log, every folder for irregularities and compared the running
> configuration to the declarations in the config files. While I have
> discovered that CONDOR_HOST has been set incorrectly due to 00debconf
> overwriting our own value, that didn't have anything to do with the problem
> at hand.
> 
> Maybe anyone got any idea why we get these messages about permission
> errors,
> although there is no user "condor" in our cluster? The problem really is
> annoying since we have to remove jobs manually after checking their logs for
> completion.
> 
> Kind regards
> Christian Hennen
> 
> -----UrsprÃngliche Nachricht-----
> Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> Im Auftrag von
> Hennen, Christian
> Gesendet: Montag, 19. Oktober 2020 12:33
> An: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Betreff: Re: [HTCondor-users] HTCondor high availability
> 
> Hi again,
> 
> two other settings seem to be set to "non-standard" values, apart from the
> allow list. LOCAL_DIR is set to /var/condor instead of /var and CONDOR_IDS
> is set to 1000.1000 ( user r-admin).
> 
> While that worked in the previous iteration of the cluster, jobs now switch
> back and forth from running to idle and vice versa.
> SchedLog says "SetEffectiveOwner security violation: setting owner to
> r-admin when active owner is "condor""
> ShadowLog says "SetEffectiveOwner(r-admin) failed with errno=13: Permission
> denied."
> 
> How do I find out which folders/files are affected and maybe have wrong
> permissions?
> 
> There are no files owned by "condor" on the network share /clients (where
> the spool and FS_REMOTE dirs now reside) and none relevant on the local hard
> drive. The only files found by "find / -xdev -user condor" are the standard
> condor directories under /var, which are all empty (due to LOCAL_DIR being
> set to /var/condor).
> 
> 
> Kind regards
> Christian Hennen
> 
> -----UrsprÃngliche Nachricht-----
> Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> Im Auftrag von
> Hennen, Christian
> Gesendet: Mittwoch, 14. Oktober 2020 17:58
> An: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Betreff: Re: [HTCondor-users] HTCondor high availability
> 
> Hi Todd,
> 
> exactly. While obviously security is important and has nothing to do with
> the HA setup itself, it was a surprise to me to have to configure security
> for the communication between the masters. That's mainly because I
> "inherited" this cluster and the original config contained * in the allow
> list, so I never experienced these type of issues. Securing the HTCondor
> part of the cluster is now added to my list of planned security changes :)
> For now, since the cluster is completely separated from the rest of the
> network, a working job processing and high availability of all services, was
> more of a priority.
> 
> After changing SEC_DEFAULT_AUTHENTICATION_METHODS to FS,FS_REMOTE
> condor_q
> now works as expected and jobs can be submitted and started. They change to
> Idle after a while, but maybe that's not related to the HTCondor config.
> 
> Kind regards
> Christian Hennen
> 
> 
> -----UrsprÃngliche Nachricht-----
> Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> Im Auftrag von
> Todd
> L Miller
> Gesendet: Freitag, 9. Oktober 2020 22:21
> An: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Betreff: Re: [HTCondor-users] HTCondor high availability
> 
> > Do I need to configure any other authentication methods in addition to
> > all servers using LDAP via PAM ?
> 
>  	Yes, of course.  Security between different nodes has nothing to
> with how users log in.
> 
> > I tried to set the variable as you suggested, to no avail. Master2 now
> > says it can't connect to master1 ("Failed to fetch ads")
> 
>  	From your description, master1 is the original "master" node.  I
> don't know if HAD will work for machines that are both submit nodes and
> central managers, but for now let's assume that it will.  Note that HA
> instructions do NOT address security at all; that's deliberate, because
> security is complicated and nothing in HA changes anything about how your
> security should work, except the addition of another server.  It's a bit
> more of surprise to you, perhaps, because you didn't separate your central
> manager from your submit server (and thus FS worked for all your
> client-to-daemon connections).
> 
>  	From your serverfault question, it looks like you basically don't
> have any security at all -- your ALLOW lists include *, so the problem must
> be in authentication, not authorization.
> 
>  	Note that condor_q, by default in recent HTCondor versions, requires
> authentication so that it only returns the jobs of the user who ran the
> command.  Try running 'condor_q -all-users'; I think that will use a
> different command that doesn't require authentication.
> 
>  	For this purpose, given that you know that the two masters share a
> filesystem and user IDs, REMOTE_FS is not a bad choice.  You'll need to set
> SEC_DEFAULT_AUTHENTICATION_METHODS on master1 and master2 to
> include FS and
> REMOTE_FS; I would remove KERBEROS (since you're not using it).
> Both master1 and master2 need to set FS_REMOTE_DIR to the same value.  Be
> sure to restart HTCondor on both machines after you've done that (I can't
> keep straight which configuration changes only require a reconfig).  Try
> running condor_q again; it should work.  If it doesn't, try running
> 
> _CONDOR_TOOL_DEBUG=D_FULLDEBUG condor_q -debug
> 
> and we'll see what we can see.
> 
> - ToddM
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature