[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs being refused by startd nodes because LocalCredd not in machine or slot ad, but is in the condor_status -long output



Thank you, Cole.Â

The attribute is there according to condor_status, I also tried to query the daemon directly (via -direct) and see that it has the attribute as well. Nothing in the config has changed (on the files) since the daemon started and we donât have scripts changing configs either per condor_config_val -config.

Is there a specific log level I can enable on the startd to get a print out of its classads at startup? The output of StartLog on job refusal shows itâs not known to the startd at time of accepting jobs which makes this confusing to troubleshoot on my end.Â

On Thu, Jan 19, 2023 at 17:32 Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
Hi Miguel,

IDTokens and authorization should not matter for this. The LocalCredd classad attribute is set into the startd ad when the startd daemon begins and upon a reconfig of the system while on the submit side the LocalCredd is set for each submit. If the configuration changes between startup of the condor system and job submission the values would differ since the startd side doesn't update unless reconfig is called. This could happen due to a script file used for configuration or just someone manually changing the config value. You can check if you have a script file setting up a portion of your config by running condor_config_val -config. This will show all sources of config files and all programs/script files that contribute will end with a pipe '|'.

Cheers,
Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Miguel Garrido <miguel@xxxxxxxxx>
Sent: Tuesday, January 17, 2023 2:00 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Jobs being refused by startd nodes because LocalCredd not in machine or slot ad, but is in the condor_status -long output
Â
I recently set up a Condor 9.0.16 pool using IDTOKENS authentication
for DAEMON communication, there is no shared pool password configured
on any nodes.

The pool consists of all Windows nodes with:

1. two hosts running the master, collector, negotiator, had,
replication, and credd daemons
 DAEMON_LIST = MASTER COLLECTOR NEGOTIATOR HAD REPLICATION CREDD
2. one host running the master, and schedd daemons
 DAEMON_LIST = MASTER SCHEDD
3. multiple hosts running the master, and startd daemons
 DAEMON_LIST = MASTER STARTD

I've noticed that almost always if I restart all services in the pool,
regardless of whether I restart the CM -> schedd -> startd services in
order, or all at the same time, the startd nodes refuse to run jobs
matched to them because the slot ad doesn't match the job
requirements. Analyzing the StartLog I noticed the slot ads are
missing the LocalCredd attribute required by the job, however, the
attribute does exist if you look at the output of condor_status -long
for that machine and all its slots. Likewise, the matchmaker sees the
same thing: condor_q -better-analyze shows a match on the attribute
for the slots in the machine leading to a match, which is eventually
rejected by the startd.

I don't know what is causing the startd daemon to omit its own
LocalCredd in the slot ad it computes for accepting job candidates,
when it is clear to the collector that the attribute exists for that
node as evidenced by the condor_status -long output. I did find an old
thread from 17 Dec 2008 ("Windows Condor problems with credd and
executing jobs as submitting user") where it was suggested to try
issuing a condor_reconfig -all from a central manager for a similar
issue. This fixed my issue as well: after all the daemons reevaluated
their configuration, jobs started being accepted by the startd nodes
pool wide.

I've not seen this behavior before in my existing 8.8 pools which use
a pool password to communicate with each other, and I wonder if it
might have anything to do with the pool authentication being IDTOKENS
based instead of PASSWORD, or something else introduced with v9?

Thank you

~
MG
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
MG