Thanks for working with me yesterday to get this working, Greg.
For others who may have a similar problem, I rebased my git repo for clarity and have a basic functioning example tagged "v1" at https://gitlab.com/manning-ncsa/htcondor-helm-chart/-/tree/v1.
There were several configuration options I had to change from defaults in order to get the central manager, access point, and worker communicating:
```
USE_POOL_PASSWORD=yes
ALLOW_WRITE=*
ALLOW_NEGOTIATOR=*
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, IDTOKENS, PASSWORD
```
where the pool password is generated randomly in this Helm chart version upon deployment and installed to the default location (`/etc/condor/passwords.d/POOL`) by the [container initialization scripts](https://github.com/htcondor/htcondor/blob/main/build/docker/services/base/update-secrets).
Additionally, as I captured in the Readme file, you need to submit jobs as user `submituser` (UID 1000) like so:
```
kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c ' \
runuser submituser bash -c " \
cd /tmp/sleep_test && condor_submit sleep.sub \
"'
```
On 2/6/24 11:10, Daues, Gregory Edward wrote:
It can be challenging to debug a fresh setup, but I often start like so:
if a job sits in Idle after submission, I like to check the NegotiatorLog
to see if there is a match. If there is a match, then I expect the two sides
Schedd on the sched pod (writing SchedLog) and Startd on the worker
(writing StartLog) to be writing information about the two sides of the match,
perhaps describing communications, authentication, etc issues that may be
occurring.
And a general item I have observed about kubernetes setups, what user
is submitting the job and expecting to run the job on the worker ? , i.e.,
is there a suitable user existing in the schedd & worker pod?
It looks like the 'condor' user has submitted the job below,
not sure if that is a suitable user for running a job , others would have to comment at that.
Greg
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of T. Andrew Manning <manninga@xxxxxxxxxxxx>
Sent: Monday, February 5, 2024 4:31 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] HTCondor Helm chartI am attempting to deploy HTCondor on Kubernetes using the official
Docker images by [constructing a Helm
chart](https://urldefense.com/v3/__https://gitlab.com/manning-ncsa/htcondor-helm-chart/__;!!DZ3fjg!8qRLnBvfwHx0c_Vy3vS-BntGVdLyaBmpVPJWmngwwV9s0_6-mnATdFAu9EFv8Q-ZLLHzErD4KwMBzbXtIxP2ug$ ). Currently
this Helm chart is a draft, work-in-progress that I only published to
solicit feedback. (Don't worry about the password in
`templates/secrets.yaml`; there is no ingress to the cluster and I will
replace the password when it matters).
The three pods are able to start, and I can submit a job via the access
point (i.e. the "submit" pod), but the job remains in an Idle state.
```
$ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c 'cd
/tmp/htcondor && condor_submit sleep.sub'
Submitting job(s).
1 job(s) submitted to cluster 1.
$ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c
'condor_status'
Name OpSys Arch State
Activity LoadAv Mem ActvtyTime
slot1@htcondor-worker-64d86c7497-f5wsp LINUX X86_64 Unclaimed
Idle 0.000 1024 0+00:00:00
Total Owner Claimed Unclaimed Matched Preempting
Drain Backfill BkIdle
X86_64/LINUX 1 0 0 1 0 0 0
0 0
Total 1 0 0 1 0 0 0
0 0
$ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c 'condor_q'
-- Schedd:
htcondor-submit-0.htcondor-submit.htcondor.svc.cluster.local :
<10.42.96.15:38553?... @ 02/05/24 21:55:37
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
condor ID: 1 2/5 21:55 _ _ 1 1 1.0
Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running,
0 held, 0 suspended
Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0
running, 0 held, 0 suspended
```
When I look at the CollectorLog on the Central Manager (cm):
```
$ kubectl exec -it -n htcondor htcondor-cm-0 -- bash -c 'less
/var/log/condor/CollectorLog'
```
I see a mix of PERMISSION_DENIED messages and messages about
communication timeouts and failed writes that look like network failures:
```
$ kubectl exec -it -n htcondor htcondor-cm-0 -- bash -c 'cat
/var/log/condor/CollectorLog'
02/05/24 21:54:52 Setting maximum file descriptors to 10240.
02/05/24 21:54:52
******************************************************
02/05/24 21:54:52 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
02/05/24 21:54:52 ** /usr/sbin/condor_collector
02/05/24 21:54:52 ** SubsystemInfo: name=COLLECTOR
type=COLLECTOR(2) class=DAEMON(1)
02/05/24 21:54:52 ** Configuration: subsystem:COLLECTOR
local:<NONE> class:DAEMON
02/05/24 21:54:52 ** $CondorVersion: 23.0.3 2024-01-04 BuildID:
700474 PackageID: 23.0.3-1 $
02/05/24 21:54:52 ** $CondorPlatform: x86_64_AlmaLinux8 $
02/05/24 21:54:52 ** PID = 32
02/05/24 21:54:52 ** Log last touched time unavailable (No such
file or directory)
02/05/24 21:54:52
******************************************************
02/05/24 21:54:52 Using config source: /etc/condor/condor_config
02/05/24 21:54:52 Using local config sources:
02/05/24 21:54:52 /etc/condor/config.d/00-htcondor-9.0.config
02/05/24 21:54:52 /etc/condor/config.d/01-env.conf
02/05/24 21:54:52 /etc/condor/config.d/01-misc.conf
02/05/24 21:54:52 /etc/condor/config.d/01-role.conf
02/05/24 21:54:52 /etc/condor/config.d/01-security.conf
02/05/24 21:54:52 /etc/condor/config.d/10-stash-plugin.conf
02/05/24 21:54:52 /etc/condor/condor_config.local
02/05/24 21:54:52 config Macros = 83, Sorted = 83, StringBytes =
2826, TablesBytes = 3084
02/05/24 21:54:52 CLASSAD_CACHING is ENABLED
02/05/24 21:54:52 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
02/05/24 21:54:52 SharedPortEndpoint: waiting for connections to
named socket collector
02/05/24 21:54:52 DaemonCore: non-shared command socket at
<10.42.96.16:46443?alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local>
02/05/24 21:54:52 Daemoncore: Listening at <0.0.0.0:46443> on TCP
(ReliSock) and UDP (SafeSock).
02/05/24 21:54:52 DaemonCore: command socket at
<10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector>
02/05/24 21:54:52 DaemonCore: private command socket at
<10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector>
02/05/24 21:54:52 In ViewServer::Init()
02/05/24 21:54:52 In CollectorDaemon::Init()
02/05/24 21:54:52 In ViewServer::Config()
02/05/24 21:54:52 In CollectorDaemon::Config()
02/05/24 21:54:52 COLLECTOR_GETAD_OPTIONS set to fast lazy-parse (0x30)
02/05/24 21:54:52 ABSENT_REQUIREMENTS = None
02/05/24 21:54:52 OfflineCollectorPlugin::configure: no persistent
store was defined for off-line ads.
02/05/24 21:54:52 enable: Creating stats hash table
02/05/24 21:54:52 Enabling CCB Server.
02/05/24 21:54:52 Will generate a bootstrap file.
02/05/24 21:54:52 Will generate a new certificate file.
02/05/24 21:54:53 CollectorAd : Inserting ** "< My Pool -
10.43.222.99@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx >"
02/05/24 21:55:13 condor_read(): timeout reading 5 bytes from
collector 10.43.222.99.
02/05/24 21:55:13 IO: Failed to read packet header
02/05/24 21:55:13 SECMAN: no classad from server, failing
02/05/24 21:55:13 ERROR: SECMAN:2007:Failed to end classad message.
02/05/24 21:55:13 Failed to send update to collector 10.43.222.99.
02/05/24 21:55:13 Unable to send UPDATE_COLLECTOR_AD to all
configured collectors
02/05/24 21:55:13 condor_write(): Socket closed when trying to
write 565 bytes to <10.42.240.0:21038>, fd is 14
02/05/24 21:55:13 Buf::write(): condor_write() failed
02/05/24 21:55:13 SECMAN: Error sending response classad to
<10.42.240.0:21038>!
AuthMethods = "FS,TOKEN,PASSWORD"
Authentication = "REQUIRED"
Command = 19
ConnectSinful = "<10.43.222.99:9618>"
CryptoMethods = "AES,BLOWFISH,3DES"
ECDHPublicKey =
"BILzoNAGQHRbtX38YiEBurMAY3L9pVl1DUVwY61Buf5GMX9An4MI9lzfxJgH2LUuuSfQfKJj6sXnKEay3zurtc8="
Enact = "NO"
Encryption = "REQUIRED"
Integrity = "REQUIRED"
IssuerKeys = "POOL"
NegotiatedSession = true
NewSession = "YES"
OutgoingNegotiation = "REQUIRED"
ParentUniqueID = "htcondor-cm-0:29:1707170091"
RemoteVersion = "$CondorVersion: 23.0.3 2024-01-04 BuildID: 700474
PackageID: 23.0.3-1 $"
ServerCommandSock =
"<10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector>"
ServerPid = 32
SessionDuration = "86400"
SessionLease = 3600
Subsystem = "COLLECTOR"
TrustDomain = "htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local"
02/05/24 21:55:13 condor_write(): Socket closed when trying to
write 13 bytes to <10.42.96.15:33973>, fd is 13
02/05/24 21:55:13 Buf::write(): condor_write() failed
02/05/24 21:55:13 AUTHENTICATE: handshake failed!
02/05/24 21:55:13 DC_AUTHENTICATE: required authentication of
10.42.96.15 failed: AUTHENTICATE:1002:Failure performing handshake
02/05/24 21:55:13 WARNING: forward resolution of _gateway doesn't
match 10.42.240.0!
02/05/24 21:55:13 PERMISSION DENIED to
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host
10.42.240.0 for command 10 (QUERY_STARTD_PVT_ADS), access level
NEGOTIATOR: reason: NEGOTIATOR authorization policy contains no matching
ALLOW entry for this request; identifiers used for this host:
10.42.240.0, hostname size = 0, original ip address = 10.42.240.0
02/05/24 21:55:13 DC_AUTHENTICATE: Command not authorized, done!
02/05/24 21:55:13 WARNING: forward resolution of _gateway doesn't
match 10.42.240.0!
02/05/24 21:55:13 MasterAd : Inserting ** "<
htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local >"
02/05/24 21:55:13 PERMISSION DENIED to
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host
10.42.240.0 for command 49 (UPDATE_NEGOTIATOR_AD), access level
NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the
full reason
02/05/24 21:55:13 DC_AUTHENTICATE: Command not authorized, done!
02/05/24 21:55:13 MasterAd : Inserting ** "<
htcondor-submit-0.htcondor-submit.htcondor.svc.cluster.local >"
02/05/24 21:55:13 MasterAd : Inserting ** "<
htcondor-worker-64d86c7497-f5wsp >"
02/05/24 21:55:13 StartdAd : Inserting ** "<
slot1@htcondor-worker-64d86c7497-f5wsp >"
02/05/24 21:55:13 StartdPvtAd : Inserting ** "<
slot1@htcondor-worker-64d86c7497-f5wsp >"
02/05/24 21:55:23 PERMISSION DENIED to
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host
10.42.240.0 for command 10 (QUERY_STARTD_PVT_ADS), access level
NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the
full reason
02/05/24 21:55:23 DC_AUTHENTICATE: Command not authorized, done!
...
```
My custom configuration should be fully captured in
`templates/configmap.yaml` which is mounted to
`/etc/condor/condor_config.local` in each container.
Any assistance in how to proceed with debugging this would be appreciated.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://urldefense.com/v3/__https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users__;!!DZ3fjg!8qRLnBvfwHx0c_Vy3vS-BntGVdLyaBmpVPJWmngwwV9s0_6-mnATdFAu9EFv8Q-ZLLHzErD4KwMBzbVmVRI7wQ$
The archives can be found at:
https://urldefense.com/v3/__https://lists.cs.wisc.edu/archive/htcondor-users/__;!!DZ3fjg!8qRLnBvfwHx0c_Vy3vS-BntGVdLyaBmpVPJWmngwwV9s0_6-mnATdFAu9EFv8Q-ZLLHzErD4KwMBzbWew4iQWg$
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://urldefense.com/v3/__https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users__;!!DZ3fjg!4wB8opWiqEQeBZz9wB5LftQP4GuQTLIjFoBSkFRUDTP4gXTi9b6u7twzaJC4KsVwSlGscuFqkiaD45ILC8M$ The archives can be found at: https://urldefense.com/v3/__https://lists.cs.wisc.edu/archive/htcondor-users/__;!!DZ3fjg!4wB8opWiqEQeBZz9wB5LftQP4GuQTLIjFoBSkFRUDTP4gXTi9b6u7twzaJC4KsVwSlGscuFqkiaD9Eg7AYQ$
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/