[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Authentication Issue between HTCondorCE Schedd and Batch Schedd



Hi Chris,

âThanks for the additional details.

âLooking at your update, before we dive deeper into the mutual authentication/FS auth issue, we should probably check the OS-level crypto policies on EL10 first.

âAs you noted, the TLS handshake is failing early and yielding an "unauthenticated" status. In EL10, the default security profiles are extremely strict and will silently reject older cryptographic algorithmsâspecifically SHA-1 signatures or shorter key lengths (which are still surprisingly common in some WLCG CA certificates, CRLs, or VOMS server responses).

âIf any certificate in the trust chain relies on SHA-1, EL10's OpenSSL will block the handshake entirely before HTCondor can even attempt to map the DN.

âTo verify if this is the culprit, I highly recommend checking and temporarily relaxing the system-wide crypto policy to see if VOMS authentication suddenly starts working.

âYou can check and change the policy using these commands on your EL10 machine:

# 1. Check the current active policy

update-crypto-policies --show


# 2. Set the policy to LEGACY to allow older algorithms like SHA-1

sudo update-crypto-policies --set LEGACY


If switching to LEGACY resolves the "unauthenticated" error, it confirms that the issue is due to the modern OS security standards blocking your VOMS/Grid certificates.

âPlease let us know how this test goes!


âBest regards,





ââââââ ìë ëì ââââââ

ëëìë : Chris Brew - STFC UKRI <chris.brew@xxxxxxxxxx>

ëëìë : "ëêë" <geonmo@xxxxxxxxxxx>, "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>

ëìëì : 2026-05-08 (ê) 18:43:06

ìë : Re: [HTCondor-users] Authentication Issue between HTCondorCE Schedd and Batch Schedd


Thanks for this.

On the Job Router config, weâve got the first two set and have tried setting JOB_ROUTER_SCHEDD2_POOL to both $(FULL_HOSTNAME):9618 and out batch systems Central  Manager as suggested by the deployment documentation, neither changes this behaviour.

And we â ve got   QUEUE_SUPER_USER_MAY_IMPERSONATE = .* In the config of our batch system Schedd, indeed when we first started we got errors about a non-super user Condor in the the SchedLog and added condorâ into QUEUE_SUPER_USERS in the Schedd config. That got rid of the non-super user error in the SchedLog but left the other errors

On the VOMS mapping, it appears to be more subtle that that, as far as I can tell it isnât passing the <DN>,<VOMS ATTRIBUTES> to the mapping comparison at all, just the string â unauthenticated â  which suggests something farther up the chain isn â t accepting the certificate. However, if I just match â .* â  to an account the resulting job has the correct X509* ClassAds set, which suggests that at some level I â ve got the /etc/grid-security VOMS information set correctly (it â s set by puppet and identical to all our other grid systems).

It may be we don â t need VOMS authentication, but a reasonable fraction of the Jobs to our existing ArcCEs use VOMS authentication and  since it shouldâ just work I wasn â t going to make them transition.

Is anyone else running this on EL10? Am I being too ambitious trying to go there?

We â ve tried adding D_SECURITY  to SCHEDD_DEBUG on the condor-ce but that didn â t seem to log anything about the certificate decoding.

On the bath system Schedd it show the Job_Router authenticating but no sign of it attempting to impersonate, just the failure to create the user record for the illegal condor user. The Job_Router is using FS to authenticate with the batch Schedd but I donât think that should make any difference.

While I donât think itâs the case of the failures the:

Failed to open /var/lib/condor/spool/job_queue.log: errno=13

Error, does seem to point to some sort of mismatch between the condorâce and condor batch setup, the batch setup has that file root: root and mode 600, and as far as I can tell condor-ce is trying to read it as condorâ. Setting the mode to 644 gets rid of the error but changes nothing else. SELINUX is disabled.

So Iâm a bit stuck at the moment.

Thanks,
Chris.

On 08/05/2026, 01:49, ""ëêë"" <geonmo@xxxxxxxxxxx> wrote:

Hi Chris,

To address the identity mismatch and Job Router errors you're seeing, I would suggest verifying the following configuration points.

1. Job Router and Permissions

First, please ensure that the Job Router on your HTCondor-CE is correctly pointing to the local Batch Schedd and has the necessary permissions to hand off jobs. Check if these are explicitly set in your condor-ce/conf.d:

      
# HTCondor-CE side JOB_ROUTER_SCHEDD2_SPOOL = /var/lib/condor/spool JOB_ROUTER_SCHEDD2_NAME = $(FULL_HOSTNAME) JOB_ROUTER_SCHEDD2_POOL = $(FULL_HOSTNAME):9618

And on the Local Batch (HTCondor) side, ensure the Schedd allows the CE's routing process to impersonate the job owner:


# Local Batch side (required for mapping [condor@<domain> -> local user(like sgmcms55/this account should be existed on CE, LRMS and WN)])

      
QUEUE_SUPER_USER_MAY_IMPERSONATE = .*

2. SSL/VOMS Mapping (HTCondor v23 vs. v24)

Regarding the VOMS authentication issues, it is important to note that the way HTCondor handles SSL/Certificate mapping changed significantly between v23 and v24+.


In newer versions, the default behavior for mapping X.509 certificates has become more strict. 


It often requires comparing not just the DN, but also additional attributes like VOMS roles. 


This change often requires adding commas or specific formatting in your mapfiles that wasn't necessary before.


You can find the detailed requirements and the new mapping logic in this EGI documentation: HTCondor and SSL authentication


Please check if your certificate DNs and roles are being mapped correctly under the new version's rules.

3. Authentication Method

Lastly, out of curiosity, is there a specific reason your site is prioritizing VOMS/SSL over SCITOKENS for this setup? 


Since many grid infrastructures are migrating toward tokens, knowing your requirements might help us suggest a more streamlined authentication path.


Hope this helps!


Best regards,

-- Geonmo


ââââââ ìë ëì ââââââ

ëëìë : Chris Brew - STFC UKRI via HTCondor-users <htcondor-users@xxxxxxxxxxx>

ëëìë : HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>

ìì : Chris Brew - STFC UKRI <chris.brew@xxxxxxxxxx>

ëìëì : 2026-05-07 (ë) 23:07:29

ìë : [HTCondor-users] Authentication Issue between HTCondorCE Schedd and Batch Schedd


Hi,

Iââve still not got anywhere with the VOMS authentication (Iâll post some more info soon), but Token auth seems to be working in that Jobs get into the condor-ce Schedd and are visible with condor_ce_q however they donât make it as far as the Schedd for the batch system.

I just copied the config of that from the config of the Schedd on our existing ArcCEs so itâs possibly itâs missing some necessary config for accepting Jobs from the Job_Router.

Iâve got three recurring errors. One in the /var/log/condor-ce/JobRouterLog:

05/07/26 14:44:29 Failed to commit job submission :
05/07/26 14:44:29 JobRouter failure (src="" failed to submit job

Which is matched with this one in /var/log/condor/SchedLog:

05/07/26 14:44:29 (pid:923597) (bt:ccbf:13) SetEffectiveOwner: UserRec lookup for owner condor@xxxxxxxxxxx found no match
05/07/26 14:44:29 (pid:923597) Owner condor@xxxxxxxxxxx has no JobQueueUserRec
05/07/26 14:44:29 (pid:923597) Creating pending JobQueueUserRec for owner condor@xxxxxxxxxxx
05/07/26 14:44:29 (pid:923597) Error: MakeUserRec with illegal identifiers: user=condor@xxxxxxxxxxx, os_user=condor
05/07/26 14:44:29 (pid:923597) NewCluster(): failed to create new User record for condor@xxxxxxxxxxx

And then another more frequent one every ten seconds in /var/log/condor-ce/JobRouterLog:

05/07/26 14:47:09 Failed to open /var/lib/condor/spool/job_queue.log: errno=13

Which looks to me like the JobRouter is trying to put jobs into the queue as (the illegal) user condor rather than the accounts the tokens are mapped to in the condor-ce Schedd (they show up there as the correctly mapped local user).

Does anyone have any idea where I should be looking?

Thanks,
Chris.