Hi Maarten,
Sorry to keep asking more information, but could you add D_SECURITY to the SCHEDD_DEBUG in the configuration (needs a condor_reconfig) and send share the Schedd log (or the relevant part related to the attempted job submission) after reproducing both the failure
case and success case once. Feel free to send the log directly to me if you want to keep out of the public eye.
-Cole Bollig
From: Maarten Litmaath <Maarten.Litmaath@xxxxxxx>
Sent: Thursday, February 13, 2025 1:14 PM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: Cole Bollig <cabollig@xxxxxxxx> Subject: Re: Re: [HTCondor-users] v24.0.4 condor_submit only works sometimes
Hi Cole,
the first test submission with the correct debug option failed as follows:
======================================================================
Submitting job(s)02/13/25 19:25:26 CRED: NO MODULES REQUESTED
02/13/25 19:25:26 CREDMON: skipping the storage of any LOCAL credential with CredD.
02/13/25 19:25:26 SECMAN: command 1112 QMGMT_WRITE_CMD to local schedd from TCP port 18611 (blocking).
02/13/25 19:25:26 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
02/13/25 19:25:26 Failed to determine available TOKEN keys: CRED:1:Cannot open /etc/condor/passwords.d: Permission denied (errno=13)
02/13/25 19:25:26 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
02/13/25 19:25:26 Failed to determine available TOKEN keys: CRED:1:Cannot open /etc/condor/passwords.d: Permission denied (errno=13)
02/13/25 19:25:26 SECMAN: new session, doing initial authentication.
02/13/25 19:25:26 SECMAN: Auth methods: FS,TOKEN,KERBEROS,SCITOKENS
02/13/25 19:25:26 AUTHENTICATE: setting timeout for <188.184.72.210:9618?addrs=188.184.72.210-9618+[2001-1458-d00-4e--100-35b]-9618&alias=htc24s-ce.cern.ch&noUDP&sock=schedd_26081_f9a3> to 20.
02/13/25 19:25:26 HANDSHAKE: in handshake(my_methods = 'FS,TOKEN,KERBEROS,SCITOKENS')
02/13/25 19:25:26 HANDSHAKE: handshake() - i am the client
02/13/25 19:25:26 HANDSHAKE: sending (methods == 6212) to server
02/13/25 19:25:26 HANDSHAKE: server replied (method = 4)
02/13/25 19:25:26 AUTHENTICATE_FS: used dir /tmp/FS_XXXVsoQeP, status: 1
02/13/25 19:25:26 Authentication was a Success.
02/13/25 19:25:26 AUTHENTICATION: setting default map to (null)
02/13/25 19:25:26 AUTHENTICATION: post-map: current FQU is '(null)'
02/13/25 19:25:26 AUTHENTICATE: Exchanging keys with remote side.
02/13/25 19:25:26 AUTHENTICATE: Result of end of authenticate is 1.
02/13/25 19:25:26 SECMAN: generating AES key for session with local schedd...
02/13/25 19:25:26 SECMAN: successfully enabled encryption!
02/13/25 19:25:26 SECMAN: successfully enabled message authenticator!
02/13/25 19:25:26 SESSION: client duplicated AES to BLOWFISH key for UDP.
02/13/25 19:25:26 SECMAN: added session htc24s-ce:26121:1739471126:396 to cache for 60 seconds (3600s lease).
02/13/25 19:25:26 SECMAN: startCommand succeeded.
.
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
======================================================================
So far, a few 100 more test job submissions all succeeded...
Here are the diffs between the failed job and the next one,
with all timestamps made the same to allow "diff" to work:
======================================================================
3c3
< 02/13/25 19:..:.. SECMAN: command 1112 QMGMT_WRITE_CMD to local schedd from TCP port 18611 (blocking).
---
> 02/13/25 19:..:.. SECMAN: command 1112 QMGMT_WRITE_CMD to local schedd from TCP port 9677 (blocking).
15c15
< 02/13/25 19:..:.. AUTHENTICATE_FS: used dir /tmp/FS_XXXVsoQeP, status: 1
---
> 02/13/25 19:..:.. AUTHENTICATE_FS: used dir /tmp/FS_XXXUdfcMH, status: 1
25c25
< 02/13/25 19:..:.. SECMAN: added session htc24s-ce:26121:1739471126:396 to cache for 60 seconds (3600s lease).
---
> 02/13/25 19:..:.. SECMAN: added session htc24s-ce:26121:1739471250:397 to cache for 60 seconds (3600s lease).
28,29c28,34
< ERROR: Failed to commit job submission into the queue.
< ERROR: Failed to create new User record for condor@xxxxxxxx
---
> 1 job(s) submitted to cluster 139.
> 02/13/25 19:..:.. SECMAN: command 421 RESCHEDULE to local schedd from TCP port 22815 (blocking).
> 02/13/25 19:..:.. SECMAN: using session htc24s-ce:26121:1739471250:397 for {<188.184.72.210:9618?addrs=188.184.72.210-9618+[2001-1458-d00-4e--100-35b]-9618&alias=htc24s-ce.cern.ch&noUDP&sock=schedd_26081_f9a3>,<421>}.
> 02/13/25 19:..:.. SECMAN: resume, NOT reauthenticating.
> 02/13/25 19:..:.. SECMAN: successfully enabled encryption!
> 02/13/25 19:..:.. SECMAN: successfully enabled message authenticator!
> 02/13/25 19:..:.. SECMAN: startCommand succeeded.
======================================================================
I am trying with D_SECURITY:2 now...
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Thursday, February 13, 2025 6:00 PM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: Cole Bollig <cabollig@xxxxxxxx> Subject: Re: [HTCondor-users] v24.0.4 condor_submit only works sometimes
Hi Maarten,
I messed up the flag for turning on debugging for condor_submit. It is supposed to be -debug:D_ALWAYS,D_SECURITY. The subtle difference is there is no space between the levels and the flag but rather a colon. With that in mind just setting D_SECURITY should
produce the output that is helpful.
-Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Thursday, February 13, 2025 8:15 AM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: Cole Bollig <cabollig@xxxxxxxx> Subject: Re: [HTCondor-users] v24.0.4 condor_submit only works sometimes
Hi Maarten,
Just to confirm you see this issue only on a V24.0.x release and things work accordingly on a v24.x and v23.x release? Do you see any tokens listed if you run condor_token_list as a user and not root? Would you be willing to try the failed submission with higher
debugging levels: -debug D_ALWAYS,D_SECURITY:2
-Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Maarten Litmaath via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, February 11, 2025 11:59 AM To: geonmo@xxxxxxxxxxx <geonmo@xxxxxxxxxxx>; htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: Maarten Litmaath <Maarten.Litmaath@xxxxxxx> Subject: Re: [HTCondor-users] v24.0.4 condor_submit only works sometimes
Hi again,
FYI, I tried to replicate with v24.x what has been working OK for v23.x.
The use of "/tmp" as account home directory has some advantages,
but I have also been able to reproduce the problem for an account
with a normal home directory under "/home":
======================================================================
[mytest@htc24s-ce ~]$ condor_ping -debug -verbose -type schedd WRITE
02/11/25 18:48:12 recognized WRITE as authorization level, using command 60021.
02/11/25 18:48:12 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
02/11/25 18:48:12 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
Destination: local schedd
Remote Version: $CondorVersion: 24.0.4 2025-02-02 BuildID: 784178 PackageID: 24.0.4-1 GitSHA: c93a1052 $
Local Version: $CondorVersion: 24.0.4 2025-02-02 BuildID: 784178 PackageID: 24.0.4-1 GitSHA: c93a1052 $
Session ID: htc24s-ce:26121:1739296092:393
Instruction: WRITE
Command: 60021
Encryption: AES
Integrity: AES
Authenticated using: FS
All authentication methods: FS,TOKEN,KERBEROS,SCITOKENS
Remote Mapping: mytest@xxxxxxx
Authorized: TRUE
======================================================================
[mytest@htc24s-ce ~]$ for i in `seq 100`; do condor_submit -debug D_SECURITY my-test.jdl || break; echo == $i; sleep 1; done
[...]
== 75
Submitting job(s)02/11/25 18:47:47 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
02/11/25 18:47:47 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
.
1 job(s) submitted to cluster 136.
== 76
Submitting job(s)02/11/25 18:47:48 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
02/11/25 18:47:48 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
.
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
======================================================================
This problem has the "hallmark" of a race condition...
From: geonmo@xxxxxxxxxxx <geonmo@xxxxxxxxxxx> on behalf of "류건모" <geonmo@xxxxxxxxxxx>
Sent: Tuesday, February 11, 2025 7:25 AM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: Maarten Litmaath <Maarten.Litmaath@xxxxxxx> Subject: RE: Re: [HTCondor-users] v24.0.4 condor_submit only works sometimes Hello, Maarten.
The core of the problem seems to be that FS authentication is not working properly and the user is authenticated as “condor@xxxxxxx”.
Could you please check the condor_ping information as user alicesgm? ---- condor_ping -debug -verbose -type schedd WRITE
.... Authenticated using: FS All authentication methods: TOKEN,FS .... -------
First, check the mount option and permissions information sharing in the /tmp directory, it may be that the alicesgm account is unable to write to /tmp or SELinux issue.
If you suspect SELinux, check the information below to see if you missed anything.
[root@ui20 tmp]# semanage permissive -l
Builtin Permissive Types
condor_negotiator_t condor_master_t condor_collector_t condor_procd_t condor_startd_t condor_schedd_t
As I know, absence of condor_schedd_t can cause SELinux to fail because actions not registered with permissive can be blocked.
Also, could you check that the account have an idtokens issued as "condor@xxxxxxx"?
Similarly, you can check by doing a condor_token_list on the alicesgm account.
Regards,
-- Geonmo
────── 원본 메일 ──────
|