[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] v24.0.4 condor_submit only works sometimes



Hi Maarten,

Just to confirm you see this issue only on a V24.0.x release and things work accordingly on a v24.x and v23.x release? Do you see any tokens listed if you run condor_token_list as a user and not root? Would you be willing to try the failed submission with higher debugging levels: -debug D_ALWAYS,D_SECURITY:2

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Maarten Litmaath via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, February 11, 2025 11:59 AM
To: geonmo@xxxxxxxxxxx <geonmo@xxxxxxxxxxx>; htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Maarten Litmaath <Maarten.Litmaath@xxxxxxx>
Subject: Re: [HTCondor-users] v24.0.4 condor_submit only works sometimes
 
Hi again,
FYI, I tried to replicate with v24.x what has been working OK for v23.x.

The use of "/tmp" as account home directory has some advantages,
but I have also been able to reproduce the problem for an account
with a normal home directory under "/home":

======================================================================
[mytest@htc24s-ce ~]$ condor_ping -debug -verbose -type schedd WRITE
02/11/25 18:48:12 recognized WRITE as authorization level, using command 60021.
02/11/25 18:48:12 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
02/11/25 18:48:12 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
Destination:                 local schedd
Remote Version:              $CondorVersion: 24.0.4 2025-02-02 BuildID: 784178 PackageID: 24.0.4-1 GitSHA: c93a1052 $
Local  Version:              $CondorVersion: 24.0.4 2025-02-02 BuildID: 784178 PackageID: 24.0.4-1 GitSHA: c93a1052 $
Session ID:                  htc24s-ce:26121:1739296092:393
Instruction:                 WRITE
Command:                     60021
Encryption:                  AES
Integrity:                   AES
Authenticated using:         FS
All authentication methods:  FS,TOKEN,KERBEROS,SCITOKENS
Remote Mapping:              mytest@xxxxxxx
Authorized:                  TRUE
======================================================================
[mytest@htc24s-ce ~]$ for i in `seq 100`; do condor_submit -debug D_SECURITY my-test.jdl || break; echo == $i; sleep 1; done
[...]
== 75
Submitting job(s)02/11/25 18:47:47 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
02/11/25 18:47:47 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
.
1 job(s) submitted to cluster 136.
== 76
Submitting job(s)02/11/25 18:47:48 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
02/11/25 18:47:48 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
.
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
======================================================================

This problem has the "hallmark" of a race condition...



From: geonmo@xxxxxxxxxxx <geonmo@xxxxxxxxxxx> on behalf of "류건모" <geonmo@xxxxxxxxxxx>
Sent: Tuesday, February 11, 2025 7:25 AM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Maarten Litmaath <Maarten.Litmaath@xxxxxxx>
Subject: RE: Re: [HTCondor-users] v24.0.4 condor_submit only works sometimes
 

Hello, Maarten.


The core of the problem seems to be that FS authentication is not working properly and the user is authenticated as “condor@xxxxxxx”. 



Could you please check the condor_ping information as user alicesgm?

----

condor_ping -debug -verbose -type schedd WRITE


....

Authenticated using:         FS

All authentication methods: TOKEN,FS

....

-------


First, check the mount option and permissions information sharing in the /tmp directory, it may be that the alicesgm account is unable to write to /tmp or SELinux issue.


If you suspect SELinux, check the information below to see if you missed anything.


[root@ui20 tmp]# semanage permissive -l


Builtin Permissive Types 


condor_negotiator_t

condor_master_t

condor_collector_t

condor_procd_t

condor_startd_t

condor_schedd_t


As I know, absence of condor_schedd_t can cause SELinux to fail because actions not registered with permissive can be blocked.



Also, could you check that the account have an idtokens issued as "condor@xxxxxxx"?


Similarly, you can check by doing a condor_token_list on the alicesgm account. 


Regards,


-- Geonmo

────── 원본 메일 ──────

보낸사람 : Maarten Litmaath via HTCondor-users <htcondor-users@xxxxxxxxxxx>

받는사람 : HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>

참조 : Maarten Litmaath <Maarten.Litmaath@xxxxxxx>

받은날짜 : 2025-02-11 (화) 09:13:56

제목 : Re: [HTCondor-users] v24.0.4 condor_submit only works sometimes




Hi Cole & Geonmo,
here are my answers:
  • SCHEDD_HOST is neither defined on the v24.0.4 Submit Node (which fails),
    nor on the v24.4.0 Submit Node (which works).  Again, the configurations
    come straight out of the Admin Quick Start Guide, nothing more.

  • The owner of a successfully submitted job is the local account under which
    the job was submitted ("alicesgm").

  • With "-debug D_SECURITY" there does not seem to be much of a clue:
[alicesgm@htc24s-ce ~]$ condor_submit -debug D_SECURITY my-test.jdl 
Submitting job(s)02/11/25 00:41:50 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
02/11/25 00:41:50 Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 13 (Permission denied)
.
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
[alicesgm@htc24s-ce ~]$ 

  • The various parameters:
[alicesgm@htc24s-ce ~]$ condor_config_val -verbose ALLOW_WRITE SCHEDD_NAME \
QUEUE_SUPER_USERS SEC_DEFAULT_AUTHENTICATION_METHODS \
SEC_DAEMON_AUTHENTICATION_METHODS SEC_CLIENT_AUTHENTICATION_METHODS \
ALLOW_DAEMON TRUST_DOMAIN

ALLOW_WRITE = *
 # at: /etc/condor/config.d/01-submit.config, line 3, use SECURITY:recommended_v24_0+12
 # raw: ALLOW_WRITE = *

Not defined: SCHEDD_NAME
 # at: <Default>
 # raw: SCHEDD_NAME = 

QUEUE_SUPER_USERS = root, condor
 # at: <Default>
 # raw: QUEUE_SUPER_USERS = root, condor

SEC_DEFAULT_AUTHENTICATION_METHODS = FS,IDTOKENS,KERBEROS,SCITOKENS,SSL
 # at: <Default>
 # raw: SEC_DEFAULT_AUTHENTICATION_METHODS = FS,IDTOKENS,KERBEROS,SCITOKENS,SSL

Not defined: SEC_DAEMON_AUTHENTICATION_METHODS

SEC_CLIENT_AUTHENTICATION_METHODS = FS,IDTOKENS,KERBEROS,SCITOKENS,SSL,ANONYMOUS
 # at: /etc/condor/config.d/01-submit.config, line 3, use SECURITY:get_htcondor_idtokens+9
 # raw: SEC_CLIENT_AUTHENTICATION_METHODS = $(SEC_DEFAULT_AUTHENTICATION_METHODS),ANONYMOUS

ALLOW_DAEMON = condor@*  condor@password
 # at: /etc/condor/config.d/01-submit.config, line 3, use SECURITY:recommended_v24_0+10
 # raw: ALLOW_DAEMON = condor@*  condor@password

TRUST_DOMAIN = htc24s-cm.cern.ch
 # at: /etc/condor/config.d/01-submit.config, line 3, use SECURITY:get_htcondor_idtokens+20
 # raw: TRUST_DOMAIN = $(CONDOR_HOST)

[alicesgm@htc24s-ce ~]$ 

  • And the tokens:
[root@htc24s-ce ~]# condor_token_list && condor_token_list -dir /etc/condor-ce/tokens.d
Header: {"alg":"HS256","kid":"POOL"} Payload: {"iat":1739038309,"iss":"htc24s-cm.cern.ch","jti":"b5175124e4a8e4c41d4141e25e0b0633","sub":"condor@xxxxxxxxxxxxxxxxx"} File: /etc/condor/tokens.d/condor@xxxxxxxxxxxxxxxxx
Header: {"alg":"HS256","kid":"POOL"} Payload: {"iat":1739038309,"iss":"htc24s-cm.cern.ch","jti":"b5175124e4a8e4c41d4141e25e0b0633","sub":"condor@xxxxxxxxxxxxxxxxx"} File: /etc/condor-ce/tokens.d/condor@xxxxxxxxxxxxxxxxx
[root@htc24s-ce ~]# 



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Monday, February 10, 2025 4:40 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Cole Bollig <cabollig@xxxxxxxx>
Subject: Re: [HTCondor-users] v24.0.4 condor_submit only works sometimes
 
Hi Maarten,

In addition to the information Geonmo mentioned to check, is the configuration value SCHEDD_HOST defined in the configuration (condor_config_val -v SCHEDD_HOST) and when the job submission is success who is the owner in the job(s) ClassAd?

Another thing that might be helpful/interesting is comparing the output of a successful and failed job submission when doing condor_submit -debug D_SECURITY <submit file>.

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of "류건모" <geonmo@xxxxxxxxxxx>
Sent: Sunday, February 9, 2025 8:00 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] v24.0.4 condor_submit only works sometimes
 

Hello, Maarten.


Could you share some variables using condor_config_val?


  • ALLOW_WRITE
  • SCHEDD_NAME
  • QUEUE_SUPER_USERS
  • SEC_DEFAULT_AUTHENTICATION_METHODS (+ SEC_DAEMON_AUTHENTICATION_METHODS if it is existed,)
  • SEC_CLIENT_AUTHENTICATION_METHODS
  • ALLOW_DAEMON
  • TRUST_DOMAIN

In addition, idtoken information? 

[On root shell, condor_token_list && condor_token_list -dir /etc/condor-ce/tokens.d]


The error we experienced is a little different from the message you showed, but the user information of the IDTOKENS used by the HTCondor-CE Daemon was not in the ALLOW_WRITE list of HTCondor, so it was rejected. 


I solved it by simply overwriting the IDTOKENS in /etc/condor/tokens.d/ with /etc/condor-ce/tokens.d/, but I don't know if it's the right solution.


However, it seems like this is an issue when submitting jobs to HTCondor via HTCondor-CE and not why HTCondor itself is not submitting.


Regards,


-- Geonmo




────── 원본 메일 ──────

보낸사람 : Maarten Litmaath via HTCondor-users <htcondor-users@xxxxxxxxxxx>

받는사람 : "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>

참조 : Maarten Litmaath <Maarten.Litmaath@xxxxxxx>

받은날짜 : 2025-02-09 (일) 22:19:58

제목 : Re: [HTCondor-users] v24.0.4 condor_submit only works sometimes




Hi again,
with an HTCondor CE installed in addition on the Submit Node,
jobs are accepted by the CE, but refused by the latter's Schedd:

02/09/25 14:07:15 (pid:10651) SetEffectiveOwner security violation: 
attempting to set owner to dis-allowed value alicesgm@xxxxxxxxxxxxxxxxx

Further advice would be appreciated, thanks!



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Maarten Litmaath via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Sunday, February 9, 2025 1:35 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Maarten Litmaath <Maarten.Litmaath@xxxxxxx>
Subject: Re: [HTCondor-users] v24.0.4 condor_submit only works sometimes
 
Hi again,
using the current version in the Feature Channel, v24.4.0, all works fine,
while the LTS Channel has the problem described below.

We do not want to advise our sites to switch to the Feature Channel,
because we usually prefer the stability of the LTS Channel...

The two Channels appear to have some unwanted difference,
for which I did not yet find a clue in the Feature Channel release notes...



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Maarten Litmaath via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Sunday, February 9, 2025 12:30 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Maarten Litmaath <Maarten.Litmaath@xxxxxxx>
Subject: [HTCondor-users] v24.0.4 condor_submit only works sometimes
 
Dear HTCondor experts,
I have set up a v24.0.4 mini cluster on Alma 9 using the Admin Quick Start Guide:


As an unprivileged user on the Submit Node, condor_submit fails as shown:

======================================================================
[alicesgm@htc24s-ce ~]$ cat my-test.jdl 
cmd = my-test.sh
output = my-test.out.$(ClusterId)
error  = my-test.err.$(ClusterId)
log = my-test.log.$(ClusterId)
+MaxMemory = 50
queue 1
[alicesgm@htc24s-ce ~]$ condor_submit my-test.jdl 
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
[alicesgm@htc24s-ce ~]$ 
======================================================================

If I keep trying, though, eventually it works:

======================================================================
[alicesgm@htc24s-ce ~]$ for i in `seq 30`; do condor_submit my-test.jdl &&
 break; sleep 61; done &>> log-$$.txt < /dev/null &
[1] 33484
[alicesgm@htc24s-ce ~]$ tail -f log-$$.txt
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
Submitting job(s).
1 job(s) submitted to cluster 19.
======================================================================

That job then runs fine, while the next job submission will fail again, etc.

There appear to be two problems here:

1) The Admin Quick Start Guide gives me a cluster that does not work.

2) Due to some bug, job submissions sometimes get through nonetheless.

Advice would be appreciated, thanks!