Hi Miguel,
IDTokens and authorization should not matter for this. The LocalCredd classad attribute is set into the startd ad when the startd daemon begins and upon a reconfig of the system while on the submit side the LocalCredd is set for each submit. If the configuration
changes between startup of the condor system and job submission the values would differ since the startd side doesn't update unless reconfig is called. This could happen due to a script file used for configuration or just someone manually changing the config
value. You can check if you have a script file setting up a portion of your config by running
condor_config_val -config. This will show all sources of config files and all programs/script files that contribute will end with a pipe '|'.
Cheers,
Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Miguel Garrido <miguel@xxxxxxxxx>
Sent: Tuesday, January 17, 2023 2:00 PM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: [HTCondor-users] Jobs being refused by startd nodes because LocalCredd not in machine or slot ad, but is in the condor_status -long output I recently set up a Condor 9.0.16 pool using IDTOKENS authentication
for DAEMON communication, there is no shared pool password configured on any nodes. The pool consists of all Windows nodes with: 1. two hosts running the master, collector, negotiator, had, replication, and credd daemons DAEMON_LIST = MASTER COLLECTOR NEGOTIATOR HAD REPLICATION CREDD 2. one host running the master, and schedd daemons DAEMON_LIST = MASTER SCHEDD 3. multiple hosts running the master, and startd daemons DAEMON_LIST = MASTER STARTD I've noticed that almost always if I restart all services in the pool, regardless of whether I restart the CM -> schedd -> startd services in order, or all at the same time, the startd nodes refuse to run jobs matched to them because the slot ad doesn't match the job requirements. Analyzing the StartLog I noticed the slot ads are missing the LocalCredd attribute required by the job, however, the attribute does exist if you look at the output of condor_status -long for that machine and all its slots. Likewise, the matchmaker sees the same thing: condor_q -better-analyze shows a match on the attribute for the slots in the machine leading to a match, which is eventually rejected by the startd. I don't know what is causing the startd daemon to omit its own LocalCredd in the slot ad it computes for accepting job candidates, when it is clear to the collector that the attribute exists for that node as evidenced by the condor_status -long output. I did find an old thread from 17 Dec 2008 ("Windows Condor problems with credd and executing jobs as submitting user") where it was suggested to try issuing a condor_reconfig -all from a central manager for a similar issue. This fixed my issue as well: after all the daemons reevaluated their configuration, jobs started being accepted by the startd nodes pool wide. I've not seen this behavior before in my existing 8.8 pools which use a pool password to communicate with each other, and I wonder if it might have anything to do with the pool authentication being IDTOKENS based instead of PASSWORD, or something else introduced with v9? Thank you ~ MG _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ |