Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Jobs being refused by startd nodes because LocalCredd not in machine or slot ad, but is in the condor_status -long output

Date: Tue, 17 Jan 2023 15:00:55 -0500
From: Miguel Garrido <miguel@xxxxxxxxx>
Subject: [HTCondor-users] Jobs being refused by startd nodes because LocalCredd not in machine or slot ad, but is in the condor_status -long output

I recently set up a Condor 9.0.16 pool using IDTOKENS authentication
for DAEMON communication, there is no shared pool password configured
on any nodes.

The pool consists of all Windows nodes with:

1. two hosts running the master, collector, negotiator, had,
replication, and credd daemons
  DAEMON_LIST = MASTER COLLECTOR NEGOTIATOR HAD REPLICATION CREDD
2. one host running the master, and schedd daemons
  DAEMON_LIST = MASTER SCHEDD
3. multiple hosts running the master, and startd daemons
  DAEMON_LIST = MASTER STARTD

I've noticed that almost always if I restart all services in the pool,
regardless of whether I restart the CM -> schedd -> startd services in
order, or all at the same time, the startd nodes refuse to run jobs
matched to them because the slot ad doesn't match the job
requirements. Analyzing the StartLog I noticed the slot ads are
missing the LocalCredd attribute required by the job, however, the
attribute does exist if you look at the output of condor_status -long
for that machine and all its slots. Likewise, the matchmaker sees the
same thing: condor_q -better-analyze shows a match on the attribute
for the slots in the machine leading to a match, which is eventually
rejected by the startd.

I don't know what is causing the startd daemon to omit its own
LocalCredd in the slot ad it computes for accepting job candidates,
when it is clear to the collector that the attribute exists for that
node as evidenced by the condor_status -long output. I did find an old
thread from 17 Dec 2008 ("Windows Condor problems with credd and
executing jobs as submitting user") where it was suggested to try
issuing a condor_reconfig -all from a central manager for a similar
issue. This fixed my issue as well: after all the daemons reevaluated
their configuration, jobs started being accepted by the startd nodes
pool wide.

I've not seen this behavior before in my existing 8.8 pools which use
a pool password to communicate with each other, and I wonder if it
might have anything to do with the pool authentication being IDTOKENS
based instead of PASSWORD, or something else introduced with v9?

Thank you

~
MG

Follow-Ups:
- Re: [HTCondor-users] Jobs being refused by startd nodes because LocalCredd not in machine or slot ad, but is in the condor_status -long output
  - From: Cole Bollig

Prev by Date: Re: [HTCondor-users] [EL9] "Boolean condor_domain_can_network_connect is not defined"
Next by Date: Re: [HTCondor-users] Some way to automatically add a resource monitoring tool (like collect) to every job in a DAG?
Previous by thread: Re: [HTCondor-users] Some way to automatically add a resource monitoring tool (like collect) to every job in a DAG?
Next by thread: Re: [HTCondor-users] Jobs being refused by startd nodes because LocalCredd not in machine or slot ad, but is in the condor_status -long output
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] Jobs being refused by startd nodes because LocalCredd not in machine or slot ad, but is in the condor_status -long output