[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Authentication Issue between HTCondorCE Schedd and Batch Schedd



Thanks for this.

On the Job Router config, weâve got the first two set and have tried setting JOB_ROUTER_SCHEDD2_POOL to both $(FULL_HOSTNAME):9618 and out batch systems Central Manager as suggested by the deployment documentation, neither changes this behaviour.

And weâve got  QUEUE_SUPER_USER_MAY_IMPERSONATE = .* In the config of our batch system Schedd, indeed when we first started we got errors about a non-super user Condor in the the SchedLog and added condorâ into QUEUE_SUPER_USERS in the Schedd config. That got rid of the non-super user error in the SchedLog but left the other errors

On the VOMS mapping, it appears to be more subtle that that, as far as I can tell it isnât passing the <DN>,<VOMS ATTRIBUTES> to the mapping comparison at all, just the string âunauthenticatedâ which suggests something farther up the chain isnât accepting the certificate. However, if I just match â.*â to an account the resulting job has the correct X509* ClassAds set, which suggests that at some level Iâve got the /etc/grid-security VOMS information set correctly (itâs set by puppet and identical to all our other grid systems).

It may be we donât need VOMS authentication, but a reasonable fraction of the Jobs to our existing ArcCEs use VOMS authentication and since it shouldâ just work I wasnât going to make them transition.

Is anyone else running this on EL10? Am I being too ambitious trying to go there?

Weâve tried adding D_SECURITY to SCHEDD_DEBUG on the condor-ce but that didnât seem to log anything about the certificate decoding.

On the bath system Schedd it show the Job_Router authenticating but no sign of it attempting to impersonate, just the failure to create the user record for the illegal condor user. The Job_Router is using FS to authenticate with the batch Schedd but I donât think that should make any difference.

While I donât think itâs the case of the failures the:

Failed to open /var/lib/condor/spool/job_queue.log: errno=13

Error, does seem to point to some sort of mismatch between the condorâce and condor batch setup, the batch setup has that file root: root and mode 600, and as far as I can tell condor-ce is trying to read it as condorâ. Setting the mode to 644 gets rid of the error but changes nothing else. SELINUX is disabled.

So Iâm a bit stuck at the moment.

Thanks,
Chris.

On 08/05/2026, 01:49, ""ëêë"" <geonmo@xxxxxxxxxxx> wrote:

Hi Chris,

To address the identity mismatch and Job Router errors you're seeing, I would suggest verifying the following configuration points.

1. Job Router and Permissions

First, please ensure that the Job Router on your HTCondor-CE is correctly pointing to the local Batch Schedd and has the necessary permissions to hand off jobs. Check if these are explicitly set in your condor-ce/conf.d:

# HTCondor-CE side JOB_ROUTER_SCHEDD2_SPOOL = /var/lib/condor/spool JOB_ROUTER_SCHEDD2_NAME = $(FULL_HOSTNAME) JOB_ROUTER_SCHEDD2_POOL = $(FULL_HOSTNAME):9618

And on the Local Batch (HTCondor) side, ensure the Schedd allows the CE's routing process to impersonate the job owner:


# Local Batch side (required for mapping [condor@<domain> -> local user(like sgmcms55/this account should be existed on CE, LRMS and WN)])

QUEUE_SUPER_USER_MAY_IMPERSONATE = .*

2. SSL/VOMS Mapping (HTCondor v23 vs. v24)

Regarding the VOMS authentication issues, it is important to note that the way HTCondor handles SSL/Certificate mapping changed significantly between v23 and v24+.


In newer versions, the default behavior for mapping X.509 certificates has become more strict. 


It often requires comparing not just the DN, but also additional attributes like VOMS roles. 


This change often requires adding commas or specific formatting in your mapfiles that wasn't necessary before.


You can find the detailed requirements and the new mapping logic in this EGI documentation: HTCondor and SSL authentication


Please check if your certificate DNs and roles are being mapped correctly under the new version's rules.

3. Authentication Method

Lastly, out of curiosity, is there a specific reason your site is prioritizing VOMS/SSL over SCITOKENS for this setup? 


Since many grid infrastructures are migrating toward tokens, knowing your requirements might help us suggest a more streamlined authentication path.


Hope this helps!


Best regards,

-- Geonmo


ââââââ ìë ëì ââââââ

ëëìë : Chris Brew - STFC UKRI via HTCondor-users <htcondor-users@xxxxxxxxxxx>

ëëìë : HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>

ìì : Chris Brew - STFC UKRI <chris.brew@xxxxxxxxxx>

ëìëì : 2026-05-07 (ë) 23:07:29

ìë : [HTCondor-users] Authentication Issue between HTCondorCE Schedd and Batch Schedd


Hi,

Iââve still not got anywhere with the VOMS authentication (Iâll post some more info soon), but Token auth seems to be working in that Jobs get into the condor-ce Schedd and are visible with condor_ce_q however they donât make it as far as the Schedd for the batch system.

I just copied the config of that from the config of the Schedd on our existing ArcCEs so itâs possibly itâs missing some necessary config for accepting Jobs from the Job_Router.

Iâve got three recurring errors. One in the /var/log/condor-ce/JobRouterLog:

05/07/26 14:44:29 Failed to commit job submission :
05/07/26 14:44:29 JobRouter failure (src="" failed to submit job

Which is matched with this one in /var/log/condor/SchedLog:

05/07/26 14:44:29 (pid:923597) (bt:ccbf:13) SetEffectiveOwner: UserRec lookup for owner condor@xxxxxxxxxxx found no match
05/07/26 14:44:29 (pid:923597) Owner condor@xxxxxxxxxxx has no JobQueueUserRec
05/07/26 14:44:29 (pid:923597) Creating pending JobQueueUserRec for owner condor@xxxxxxxxxxx
05/07/26 14:44:29 (pid:923597) Error: MakeUserRec with illegal identifiers: user=condor@xxxxxxxxxxx, os_user=condor
05/07/26 14:44:29 (pid:923597) NewCluster(): failed to create new User record for condor@xxxxxxxxxxx

And then another more frequent one every ten seconds in /var/log/condor-ce/JobRouterLog:

05/07/26 14:47:09 Failed to open /var/lib/condor/spool/job_queue.log: errno=13

Which looks to me like the JobRouter is trying to put jobs into the queue as (the illegal) user condor rather than the accounts the tokens are mapped to in the condor-ce Schedd (they show up there as the correctly mapped local user).

Does anyone have any idea where I should be looking?

Thanks,
Chris.