[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_ssh_to_job & (remote) DAG



Hi again,

 

I have played a bit with Jaimeâs idea and changed the mapfile, from:

 

FS /(.*)/ \1@fsauth

 

To:

 

FS /(.*)/ \1@xxxxxxx

 

Just with that change, without any change in the ALLOW rules, all the problems (DAG, superuser, CE) seem to disappear. Certainly, we should test more but itâs interesting.

 

Now, I was thinking that by changing only the mapfile, and not the ALLOW clauses, we are only changing what someone authenticated via FS can do, but not what any Kerberos-authenticated user can or canât do. If only root at the machine uses FS auth, and this user has all powers anyway, then it seems we would not really be adding security risks with this change. What do you think?

 

(I have also tried with something like âFS /(.*)/ \1-fsauth@xxxxxxxâ and then add condor-fsauth@xxxxxxx to all the relevant ALLOW clauses. But I wasn't able to make it work, as if the string 'condor' is literal in some parts of the code... or some other reason).

 

 

Cheers,

    Antonio

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Antonio Delgado Peris via HTCondor-users
Sent: Thursday, 17 July 2025 12:47
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Antonio Delgado Peris <Antonio.Delgado.Peris@xxxxxxx>
Subject: Re: [HTCondor-users] condor_ssh_to_job & (remote) DAG

 

Hi,

 

Throwing some additional information (hopefully not confusion) to the matter, there is another issue we saw when we upgraded to v24 (and we hadnât come to really debug yet). We originally thought it was unconnected, but we now think it probably has the same cause.

 

When we tried to upgrade an HTCondor-CE to v24, it stopped being able to submit to the local pool. The most relevant logs in the schedd part were:

 

06/12/25 10:01:04 (pid:3914745) (bt:5990:13) SetEffectiveOwner: UserRec lookup for owner condor@fsauth found no match Backtrace bt:5990:13 is condor_schedd(_Z22QmgmtSetEffectiveOwnerPKc+0x74b) [0x559d87adfe1b] condor_schedd(_Z12do_Q_requestR9QmgmtPeer+0x2431) [0x559d87b02a71] condor_schedd(_Z8handle_qiP6Stream+0x8c) [0x559d87ae478c] /lib64/libcondor_utils_24_0_7.so(_ZN10DaemonCore18CallCommandHandlerEiP6Streambbff+0x258) [0x7f4e763b11f8] /lib64/libcondor_utils_24_0_7.so(_ZN10DaemonCore21HandleReqPayloadReadyEP6Stream+0x11c) [0x7f4e763b154c] /lib64/libcondor_utils_24_0_7.so(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x1e1) [0x7f4e763b5741] /lib64/libcondor_utils_24_0_7.so(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x21) [0x7f4e763b5951] /lib64/libcondor_utils_24_0_7.so(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x3c) [0x7f4e761eca1c] /lib64/libcondor_utils_24_0_7.so(_ZN10DaemonCore6DriverEv+0x19b7) [0x7f4e763b4947] /lib64/libcondor_utils_24_0_7.so(_Z7dc_mainiPPc+0x134e) [0x7f4e763d631e] /lib64/libc.so.6(+0x295d0) [0x7f4e758295d0] /lib64/libc.so.6(__libc_start_main+0x80) [0x7f4e75829680] condor_schedd(_start+0x25) [0x559d87aa5955] 06/12/25 10:01:04 (pid:3914745) SetEffectiveOwner security violation: setting owner to delgadop@xxxxxxx when active owner is non-superuser "condor@fsauth" 06/12/25 10:01:04 (pid:3914745) condor_read(): Socket closed when trying to read 5 byte packet header from <188.185.101.26:16053> 06/12/25 10:01:04 (pid:3914745) IO: EOF reading packet header 06/12/25 10:01:04 (pid:3914745) QMGR Connection closed

 

It seems that, again, even if âcondorâ is listed as superuser, Condor in fact compares to âcondor@xxxxxxxâ, which is not âcondor@fsauthâ. At the time, we were confused because âcondor@fsauthâ is included in the ALLOW_ADMINISTRATOR list, but it seems that doesn't really help here.

 

Cheers,

    Antonio

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jaime Frey via HTCondor-users
Sent: Wednesday, 16 July 2025 20:52
To: HTCondor-Users Mail List <
htcondor-users@xxxxxxxxxxx>
Cc: James Frey <
jfrey@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_ssh_to_job & (remote) DAG

 

 

 

On Jul 16, 2025, at 1:35âPM, Bockelman, Brian <BBockelman@xxxxxxxxxxxxx> wrote:

 



On Jul 16, 2025, at 12:49âPM, Ben Jones via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

 

Hi Jaime,

 

Iâve looked through the config files for one of the bigbird machines and it looks like you could change the fsauth domain to cern.ch, both in the mapfile and in all of the ALLOW rules. It would not materially affect your security policy, since all of the mentions of fsauth include restricting the client to the local machine.

 

You mean in the mapfile changing:

 

FS /(.*)/ \1@fsauth

 

to I guess:

 

FS /(.*)/ \1@xxxxxxx@fsauth 

 

?

 

The ALLOW rules Iâm a bit dubious about, sinceâ 

 

Do you see a problem with making that change?

 

â Iâm not sure what the implication is of having root with an fsauth domain for instance (root@xxxxxxx is not the same as a uid 0 local account). I also donât really get why we canât condor_ssh_to_job as root on the AP to a running job of a user, or why a change in fsauth domain would solve that problem. I get I suppose why bejones@xxxxxxx might != bejones@fsauth (though, again, I thought UID_DOMAIN was for that), but I donât get what might be stopping root@fsauth doing a condor_ssh_to_job to a job owned by bejones or bejones@xxxxxxx when either itâs a queue super user or it isnât.

 

Hi Ben,

 

You are asking the right questions!

 

Thoughts:

- Unfortunately, no, that's not how UID_DOMAIN works to the best of my knowledge.

- Why on earth does condor_dagman inside the scheduler universe have to play authentication games with the schedd?  The schedd created the damn process, it knows exactly who it should be!

- "Other systems" I work with allow you to outright refuse using Unix accounts below UID 1000 (configurable), preventing Kerberos credential root@xxxxxxx (I think I might be on that mail list?) from being mistaken for UID 0.  That might be a good feature request here for FS auth.

- Remember how I said that HTCondor was trying to clean up confusion between the Unix account and the owner?  "QUEUE_SUPER_USERS" is another problem point.  It has historically referred to Unix usernames ("root") but "smells like" it should refer to identities in the schedd ("root@fsauth").  The symptoms you write smell like a bug to me.

 

There's no kerberos credential available to the dagman process for authentication, right?  If not, then I'm drawing a blank for a workaround.

 

Today, we discussed having the schedd give the condor_dagman process a credential that identifies it as the job owner to the schedd when it connects. Thatâs the real solution to this problem. But thatâs a new feature for 24.X.

 

Agreed, QUEUE_SUPER_USERS has not been fully rationalized to the new full-qualified user identities. I believe the problem here is that the schedd appends UID_DOMAIN to the bare usernames in QUEUE_SUPER_USERS, which is âcern.ch'. Client that authenticate via filesystem get domain âfsauth', so the full name doesnât match.

 

Scheduler universe jobs get an AFS token but donât get the kerberos credential of the user. I donât see a reason for them to not have a copy of the kerberos credential, itâs just some implementation work.

 

 - Jaime