Judging by your findings and the fact that setting the
DAGMAN_USE_DIRECT_SUBMIT = false in the config file, it appears
to be an issue with the transition from DAGs submitting jobs via
shelling out to condor_submit to the direct job submission to
the schedd, but I can't say with certainty.
ÂI will do some digging around. It may take me a little bit as I
am a newer team member, but in theÂmeantime can I you send the
nodes.log file.
Cheers,
Cole Bollig
For those running into this issue, the workaround is to set
config variable
ÂÂÂ DAGMAN_USE_DIRECT_SUBMIT = False
This reverts to the pre-9.7.0 behaviour where DAGMan submits
through condor_submit.
Marco
On 08/04/2022 12:34, Marco van
Zwetselaar wrote:
Dear all,
I have tracked this down to what looks like a bug related to
this in the 9.7.0 Changelog:
- DAGMan submits jobs directly (does
not shell out to condor_submit)
The process can't open /etc/krb5.keytab because it is running
as the submitting user, not the daemon, whereas it wants to
authenticate to the local schedd as daemon (and hence needs
the host credentials from the keytab).
This is the code in src/condor_io/condor_auth_kerberos
(comments elided) where the process appears to take the wrong
turn:
ÂÂÂ int Condor_Auth_Kerberos ::
authenticate(const char * /* remoteHost */, CondorError* /*
errstack */, bool /*non_blocking*/)
ÂÂÂ {
ÂÂÂÂÂÂÂ if ( mySock_->isClient() )
ÂÂÂÂÂÂÂ {
ÂÂÂÂÂÂÂÂÂÂÂ if (init_kerberos_context() &&
init_server_info())
ÂÂÂÂÂÂÂÂÂÂÂ {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (isDaemon() ||
get_mySubSystem()->isDaemon() ) {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ status = init_daemon();
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ } else {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ status = init_user();
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ }
The process goes into init_daemon(), whereas for regular job
submissions it goes into init_user().
I'm happy to work on fixing this (I have some familiarity with
that file, winning the bug of the week in 2020 for a 16y old
one[1]) but it seems to a bit of a deeper issue: what _should_
the code be running as?
Cheers,
Marco
[1]
https://github.com/htcondor/htcondor/pull/99
On 07/04/2022 04:33, Marco van
Zwetselaar wrote:
Dear all,
Submitting jobs on my cluster works fine, but submitted DAGs
fail with what looks like an authentication failure.
I've tested with the simplest possible DAG ("JOB test
test.job") and the simplest possible job ("executable =
/bin/hostname"), submitted by the same user, on the same
machine. The job goes well, but the DAG fails with this in
the logs:
In test.dag.dagman.out:
04/07/22 03:14:20 Submitting HTCondor
Node test job(s)...
04/07/22 03:14:20 Submitting node test from file test.job
using direct job submission
04/07/22 03:14:20 AUTH_ERROR: Generic preauthentication
failure
04/07/22 03:14:20 SECMAN: required authentication with
local schedd failed, so aborting command QMGMT_WRITE_CMD.
04/07/22 03:14:20 Can't connect to queue manager:
AUTHENTICATE:1003:Failed to authenticate with any
method|AUTHENTICATE:1004:Failed to authenticate using
KERBEROS
And the SchedLog has:
04/07/22 03:14:20 (pid:1072)
DC_AUTHENTICATE: authentication of <x.x.x.x:43467>
did not result in a valid mapped user name, which is
required for this command (1112 QMGMT_WRITE_CMD), so
aborting.
04/07/22 03:14:20 (pid:1072) DC_AUTHENTICATE: reason for
authentication failure: AUTHENTICATE:1003:Failed to
authenticate with any method|AUTHENTICATE:1004:Failed to
authenticate using KERBEROS
Our authentication is all Kerberos (FreeIPA) and works well
across the cluster. The x.x.x.x is the IP of the local
machine. The user's credentials are OK: 'condor_submit
test.job' right after the failure runs fine.
When I run with debug logging, I see this in the lead up to
the above:
04/07/22 03:46:03 (D_SECURITY)
SECMAN: new session, doing initial authentication.
04/07/22 03:46:03 (fd:7) (pid:287909) (D_SECURITY) SECMAN:
Auth methods: KERBEROS
04/07/22 03:46:03 (D_SECURITY) AUTHENTICATE: setting
timeout for
<x.x.x.x:9618?addrs=x.x.x.x-9618&alias=crick.my.domain&noUDP&sock=schedd_968_8bc4>
to 20.
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: in
handshake(my_methods = 'KERBEROS')
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: handshake() - i
am the client
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: server replied
(method = 64)
04/07/22 03:46:03 (D_SECURITY) KERBEROS: get remote server
principal for "host/crick.my.domain"
04/07/22 03:46:03 (D_SECURITY) KERBEROS:
krb5_unparse_name:
host/crick.my.domain@xxxxxxxxx
04/07/22 03:46:03 (D_SECURITY) KERBEROS: no user yet
determined, will grab up to slash
04/07/22 03:46:03 (D_SECURITY) KERBEROS: picked user: host
04/07/22 03:46:03 (D_SECURITY) KERBEROS: remapping 'host'
to 'condor'
04/07/22 03:46:03 (D_SECURITY) unable to open map file
(null), errno 22
04/07/22 03:46:03 (D_SECURITY) Client is
condor@xxxxxxxxx
04/07/22 03:46:03 (D_SECURITY) init_daemon: client
principal is 'host/crick.my.domain@xxxxxxxxx'
04/07/22 03:46:03 (D_SECURITY) init_daemon: Using default
keytab
FILE:/etc/krb5.keytab
04/07/22 03:46:03 (D_SECURITY) init_daemon: Trying to get
tgt credential for service
host/crick.my.domain@xxxxxxxxx
04/07/22 03:46:03 (D_PRIV) PRIV_CONDOR --> PRIV_ROOT at
/var/lib/condor/execute/slot1/dir_93903/userdir/.tmpWTI97r/condor-9.7.0/src/condor_io/condor_auth_kerberos.cpp:632
04/07/22 03:46:03 (D_PRIV) PRIV_ROOT --> PRIV_CONDOR at
/var/lib/condor/execute/slot1/dir_93903/userdir/.tmpWTI97r/condor-9.7.0/src/condor_io/condor_auth_kerberos.cpp:634
04/07/22 03:46:03 (D_ALWAYS) AUTH_ERROR: Generic
preauthentication failure
04/07/22 03:46:03 (D_SECURITY) AUTHENTICATE: method 64
(KERBEROS) failed.
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: in
handshake(my_methods = '')
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: handshake() - i
am the client
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: sending (methods
== 0) to server
04/07/22 03:46:03 (D_SECURITY) HANDSHAKE: server replied
(method = 0)
04/07/22 03:46:03 (D_ALWAYS) SECMAN: required
authentication with local schedd failed, so aborting
command QMGMT_WRITE_CMD.
04/07/22 03:46:03 (fd:6) (pid:287909) (D_ALWAYS) WARNING:
failed to connect to queue manager
(AUTHENTICATE:1003:Failed to authenticate with any
method|AUTHENTICATE:1004:Failed to authenticate using
KERBEROS)
I'm not sure what the mechanics are supposed to be, but it
looks like the local machine (rather than the user)
credentials are being used to authenticate with the local
schedd, and this somehow doesn't work? Could it be that this
code is running as non-root so the krb5.keytab is
inaccessible?
Cheers
Marco
--
Marco van Zwetselaar
Bioinformatician
Kilimanjaro Clinical Research Institute
P.O. Box 2236 | Moshi, Kilimanjaro | Tanzania
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/