Dear all,
I have tracked this down to what looks like a bug related to this
in the 9.7.0 Changelog:
- DAGMan submits jobs directly (does not
shell out to condor_submit)
The process can't open /etc/krb5.keytab because it is running as
the submitting user, not the daemon, whereas it wants to
authenticate to the local schedd as daemon (and hence needs the
host credentials from the keytab).
This is the code in src/condor_io/condor_auth_kerberos (comments
elided) where the process appears to take the wrong turn:
ÂÂÂ int Condor_Auth_Kerberos ::
authenticate(const char * /* remoteHost */, CondorError* /*
errstack */, bool /*non_blocking*/)
ÂÂÂ {
ÂÂÂÂÂÂÂ if ( mySock_->isClient() )
ÂÂÂÂÂÂÂ {
ÂÂÂÂÂÂÂÂÂÂÂ if (init_kerberos_context() &&
init_server_info())
ÂÂÂÂÂÂÂÂÂÂÂ {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (isDaemon() ||
get_mySubSystem()->isDaemon() ) {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ status = init_daemon();
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ } else {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ status = init_user();
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ }
The process goes into init_daemon(), whereas for regular job
submissions it goes into init_user().
I'm happy to work on fixing this (I have some familiarity with
that file, winning the bug of the week in 2020 for a 16y old
one[1]) but it seems to a bit of a deeper issue: what _should_ the
code be running as?
Cheers,
Marco
[1] https://github.com/htcondor/htcondor/pull/99
On 07/04/2022 04:33, Marco van
Zwetselaar wrote:
Dear all,
Submitting jobs on my cluster works fine, but submitted DAGs
fail with what looks like an authentication failure.
I've tested with the simplest possible DAG ("JOB test test.job")
and the simplest possible job ("executable = /bin/hostname"),
submitted by the same user, on the same machine. The job goes
well, but the DAG fails with this in the logs:
In test.dag.dagman.out:
04/07/22 03:14:20 Submitting HTCondor
Node test job(s)...
04/07/22 03:14:20 Submitting node test from file test.job
using direct job submission
04/07/22 03:14:20 AUTH_ERROR: Generic preauthentication
failure
04/07/22 03:14:20 SECMAN: required authentication with local
schedd failed, so aborting command QMGMT_WRITE_CMD.
04/07/22 03:14:20 Can't connect to queue manager:
AUTHENTICATE:1003:Failed to authenticate with any
method|AUTHENTICATE:1004:Failed to authenticate using KERBEROS
And the SchedLog has:
04/07/22 03:14:20 (pid:1072)
DC_AUTHENTICATE: authentication of <x.x.x.x:43467> did
not result in a valid mapped user name, which is required for
this command (1112 QMGMT_WRITE_CMD), so aborting.
04/07/22 03:14:20 (pid:1072) DC_AUTHENTICATE: reason for
authentication failure: AUTHENTICATE:1003:Failed to
authenticate with any method|AUTHENTICATE:1004:Failed to
authenticate using KERBEROS
Our authentication is all Kerberos (FreeIPA) and works well
across the cluster. The x.x.x.x is the IP of the local machine.
The user's credentials are OK: 'condor_submit test.job' right
after the failure runs fine.
When I run with debug logging, I see this in the lead up to the
above:
04/07/22 03:46:03 (D_SECURITY) SECMAN:
new session, doing initial authentication.
04/07/22 03:46:03 (fd:7) (pid:287909) (D_SECURITY) SECMAN:
Auth methods: KERBEROS
04/07/22 03:46:03 (D_SECURITY) AUTHENTICATE: setting timeout
for
<x.x.x.x:9618?addrs=x.x.x.x-9618&alias=crick.my.domain&noUDP&sock=schedd_968_8bc4>
to 20.
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: in
handshake(my_methods = 'KERBEROS')
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: handshake() - i am
the client
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: server replied
(method = 64)
04/07/22 03:46:03 (D_SECURITY) KERBEROS: get remote server
principal for "host/crick.my.domain"
04/07/22 03:46:03 (D_SECURITY) KERBEROS: krb5_unparse_name: host/crick.my.domain@xxxxxxxxx
04/07/22 03:46:03 (D_SECURITY) KERBEROS: no user yet
determined, will grab up to slash
04/07/22 03:46:03 (D_SECURITY) KERBEROS: picked user: host
04/07/22 03:46:03 (D_SECURITY) KERBEROS: remapping 'host' to
'condor'
04/07/22 03:46:03 (D_SECURITY) unable to open map file (null),
errno 22
04/07/22 03:46:03 (D_SECURITY) Client is condor@xxxxxxxxx
04/07/22 03:46:03 (D_SECURITY) init_daemon: client principal
is 'host/crick.my.domain@xxxxxxxxx'
04/07/22 03:46:03 (D_SECURITY) init_daemon: Using default
keytab FILE:/etc/krb5.keytab
04/07/22 03:46:03 (D_SECURITY) init_daemon: Trying to get tgt
credential for service host/crick.my.domain@xxxxxxxxx
04/07/22 03:46:03 (D_PRIV) PRIV_CONDOR --> PRIV_ROOT at
/var/lib/condor/execute/slot1/dir_93903/userdir/.tmpWTI97r/condor-9.7.0/src/condor_io/condor_auth_kerberos.cpp:632
04/07/22 03:46:03 (D_PRIV) PRIV_ROOT --> PRIV_CONDOR at
/var/lib/condor/execute/slot1/dir_93903/userdir/.tmpWTI97r/condor-9.7.0/src/condor_io/condor_auth_kerberos.cpp:634
04/07/22 03:46:03 (D_ALWAYS) AUTH_ERROR: Generic
preauthentication failure
04/07/22 03:46:03 (D_SECURITY) AUTHENTICATE: method 64
(KERBEROS) failed.
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: in
handshake(my_methods = '')
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: handshake() - i am
the client
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: sending (methods ==
0) to server
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: server replied
(method = 0)
04/07/22 03:46:03 (D_ALWAYS) SECMAN: required authentication
with local schedd failed, so aborting command QMGMT_WRITE_CMD.
04/07/22 03:46:03 (fd:6) (pid:287909) (D_ALWAYS) WARNING:
failed to connect to queue manager (AUTHENTICATE:1003:Failed
to authenticate with any method|AUTHENTICATE:1004:Failed to
authenticate using KERBEROS)
I'm not sure what the mechanics are supposed to be, but it looks
like the local machine (rather than the user) credentials are
being used to authenticate with the local schedd, and this
somehow doesn't work? Could it be that this code is running as
non-root so the krb5.keytab is inaccessible?
Cheers
Marco
--
Marco van Zwetselaar
Bioinformatician
Kilimanjaro Clinical Research Institute
P.O. Box 2236 | Moshi, Kilimanjaro | Tanzania