Re: [HTCondor-users] condor_submit_dag fails with authentication failure

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Fri, 08 Apr 2022 23:16:55 +0300

From: Marco van Zwetselaar <zwets@xxxxxxxxxx>

Subject: Re: [HTCondor-users] condor_submit_dag fails with authentication failure

On 08/04/2022 12:34, Marco van Zwetselaar wrote:

Dear all,

I have tracked this down to what looks like a bug related to this in the 9.7.0 Changelog:

- DAGMan submits jobs directly (does not shell out to condor_submit)

The process can't open /etc/krb5.keytab because it is running as the submitting user, not the daemon, whereas it wants to authenticate to the local schedd as daemon (and hence needs the host credentials from the keytab).

This is the code in src/condor_io/condor_auth_kerberos (comments elided) where the process appears to take the wrong turn:

ÂÂÂ int Condor_Auth_Kerberos :: authenticate(const char * /* remoteHost */, CondorError* /* errstack */, bool /*non_blocking*/)
ÂÂÂ {
ÂÂÂÂÂÂÂ if ( mySock_->isClient() )
ÂÂÂÂÂÂÂ {
ÂÂÂÂÂÂÂÂÂÂÂ if (init_kerberos_context() && init_server_info())
ÂÂÂÂÂÂÂÂÂÂÂ {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (isDaemon() || get_mySubSystem()->isDaemon() ) {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ status = init_daemon();
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ } else {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ status = init_user();
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ }

The process goes into init_daemon(), whereas for regular job submissions it goes into init_user().

I'm happy to work on fixing this (I have some familiarity with that file, winning the bug of the week in 2020 for a 16y old one[1]) but it seems to a bit of a deeper issue: what _should_ the code be running as?

Cheers,
Marco

[1] https://github.com/htcondor/htcondor/pull/99

On 07/04/2022 04:33, Marco van Zwetselaar wrote:

Dear all,

Submitting jobs on my cluster works fine, but submitted DAGs fail with what looks like an authentication failure.

I've tested with the simplest possible DAG ("JOB test test.job") and the simplest possible job ("executable = /bin/hostname"), submitted by the same user, on the same machine. The job goes well, but the DAG fails with this in the logs:

In test.dag.dagman.out:

04/07/22 03:14:20 Submitting HTCondor Node test job(s)...
04/07/22 03:14:20 Submitting node test from file test.job using direct job submission
04/07/22 03:14:20 AUTH_ERROR: Generic preauthentication failure
04/07/22 03:14:20 SECMAN: required authentication with local schedd failed, so aborting command QMGMT_WRITE_CMD.
04/07/22 03:14:20 Can't connect to queue manager: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS

And the SchedLog has:

04/07/22 03:14:20 (pid:1072) DC_AUTHENTICATE: authentication of <x.x.x.x:43467> did not result in a valid mapped user name, which is required for this command (1112 QMGMT_WRITE_CMD), so aborting.
04/07/22 03:14:20 (pid:1072) DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS

Our authentication is all Kerberos (FreeIPA) and works well across the cluster. The x.x.x.x is the IP of the local machine. The user's credentials are OK: 'condor_submit test.job' right after the failure runs fine.

When I run with debug logging, I see this in the lead up to the above:

04/07/22 03:46:03 (D_SECURITY) SECMAN: new session, doing initial authentication.
04/07/22 03:46:03 (fd:7) (pid:287909) (D_SECURITY) SECMAN: Auth methods: KERBEROS
04/07/22 03:46:03 (D_SECURITY) AUTHENTICATE: setting timeout for <x.x.x.x:9618?addrs=x.x.x.x-9618&alias=crick.my.domain&noUDP&sock=schedd_968_8bc4> to 20.
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: in handshake(my_methods = 'KERBEROS')
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: handshake() - i am the client
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: server replied (method = 64)
04/07/22 03:46:03 (D_SECURITY) KERBEROS: get remote server principal for "host/crick.my.domain"
04/07/22 03:46:03 (D_SECURITY) KERBEROS: krb5_unparse_name: host/crick.my.domain@xxxxxxxxx
04/07/22 03:46:03 (D_SECURITY) KERBEROS: no user yet determined, will grab up to slash
04/07/22 03:46:03 (D_SECURITY) KERBEROS: picked user: host
04/07/22 03:46:03 (D_SECURITY) KERBEROS: remapping 'host' to 'condor'
04/07/22 03:46:03 (D_SECURITY) unable to open map file (null), errno 22
04/07/22 03:46:03 (D_SECURITY) Client is condor@xxxxxxxxx
04/07/22 03:46:03 (D_SECURITY) init_daemon: client principal is 'host/crick.my.domain@xxxxxxxxx'
04/07/22 03:46:03 (D_SECURITY) init_daemon: Using default keytab FILE:/etc/krb5.keytab
04/07/22 03:46:03 (D_SECURITY) init_daemon: Trying to get tgt credential for service host/crick.my.domain@xxxxxxxxx
04/07/22 03:46:03 (D_PRIV) PRIV_CONDOR --> PRIV_ROOT at /var/lib/condor/execute/slot1/dir_93903/userdir/.tmpWTI97r/condor-9.7.0/src/condor_io/condor_auth_kerberos.cpp:632
04/07/22 03:46:03 (D_PRIV) PRIV_ROOT --> PRIV_CONDOR at /var/lib/condor/execute/slot1/dir_93903/userdir/.tmpWTI97r/condor-9.7.0/src/condor_io/condor_auth_kerberos.cpp:634
04/07/22 03:46:03 (D_ALWAYS) AUTH_ERROR: Generic preauthentication failure
04/07/22 03:46:03 (D_SECURITY) AUTHENTICATE: method 64 (KERBEROS) failed.
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: in handshake(my_methods = '')
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: handshake() - i am the client
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: sending (methods == 0) to server
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: server replied (method = 0)
04/07/22 03:46:03 (D_ALWAYS) SECMAN: required authentication with local schedd failed, so aborting command QMGMT_WRITE_CMD.
04/07/22 03:46:03 (fd:6) (pid:287909) (D_ALWAYS) WARNING: failed to connect to queue manager (AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS)

I'm not sure what the mechanics are supposed to be, but it looks like the local machine (rather than the user) credentials are being used to authenticate with the local schedd, and this somehow doesn't work? Could it be that this code is running as non-root so the krb5.keytab is inaccessible?

Cheers
Marco

--

Marco van Zwetselaar

Bioinformatician

Kilimanjaro Clinical Research Institute

P.O. Box 2236 | Moshi, Kilimanjaro | Tanzania

http://www.kcri.ac.tz/ÂÂ|ÂÂzwets@xxxxxxxxxxÂÂ|ÂÂ+255 782 334124

Mailing List Archives

Authenticated access

Re: [HTCondor-users] condor_submit_dag fails with authentication failure