It can be challenging to debug a fresh setup, but I often start like so:
if a job sits in Idle after submission, I like to check the NegotiatorLog to see if there is a match. If there is a match, then I expect the two sides Schedd on the sched pod (writing SchedLog) and Startd on the worker (writing StartLog) to be writing information about the two sides of the match, perhaps describing communications, authentication, etc issues that may be occurring.
And a general item I have observed about kubernetes setups, what user
is submitting the job and expecting to run the job on the worker ? , i.e., is there a suitable user existing in the schedd & worker pod? It looks like the 'condor' user has submitted the job below, not sure if that is a suitable user for running a job , others would have to comment at that.
Greg
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of T. Andrew Manning <manninga@xxxxxxxxxxxx>
Sent: Monday, February 5, 2024 4:31 PM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Subject: [HTCondor-users] HTCondor Helm chart I am attempting to deploy HTCondor on Kubernetes using the official
Docker images by [constructing a Helm chart](https://urldefense.com/v3/__https://gitlab.com/manning-ncsa/htcondor-helm-chart/__;!!DZ3fjg!8qRLnBvfwHx0c_Vy3vS-BntGVdLyaBmpVPJWmngwwV9s0_6-mnATdFAu9EFv8Q-ZLLHzErD4KwMBzbXtIxP2ug$ ). Currently this Helm chart is a draft, work-in-progress that I only published to solicit feedback. (Don't worry about the password in `templates/secrets.yaml`; there is no ingress to the cluster and I will replace the password when it matters). The three pods are able to start, and I can submit a job via the access point (i.e. the "submit" pod), but the job remains in an Idle state. ``` $ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c 'cd /tmp/htcondor && condor_submit sleep.sub' Submitting job(s). 1 job(s) submitted to cluster 1. $ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c 'condor_status' Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@htcondor-worker-64d86c7497-f5wsp LINUX X86_64 Unclaimed Idle 0.000 1024 0+00:00:00 Total Owner Claimed Unclaimed Matched Preempting Drain Backfill BkIdle X86_64/LINUX 1 0 0 1 0 0 0 0 0 Total 1 0 0 1 0 0 0 0 0 $ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c 'condor_q' -- Schedd: htcondor-submit-0.htcondor-submit.htcondor.svc.cluster.local : <10.42.96.15:38553?... @ 02/05/24 21:55:37 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS condor ID: 1 2/5 21:55 _ _ 1 1 1.0 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended ``` When I look at the CollectorLog on the Central Manager (cm): ``` $ kubectl exec -it -n htcondor htcondor-cm-0 -- bash -c 'less /var/log/condor/CollectorLog' ``` I see a mix of PERMISSION_DENIED messages and messages about communication timeouts and failed writes that look like network failures: ``` $ kubectl exec -it -n htcondor htcondor-cm-0 -- bash -c 'cat /var/log/condor/CollectorLog' 02/05/24 21:54:52 Setting maximum file descriptors to 10240. 02/05/24 21:54:52 ****************************************************** 02/05/24 21:54:52 ** condor_collector (CONDOR_COLLECTOR) STARTING UP 02/05/24 21:54:52 ** /usr/sbin/condor_collector 02/05/24 21:54:52 ** SubsystemInfo: name=COLLECTOR type=COLLECTOR(2) class=DAEMON(1) 02/05/24 21:54:52 ** Configuration: subsystem:COLLECTOR local:<NONE> class:DAEMON 02/05/24 21:54:52 ** $CondorVersion: 23.0.3 2024-01-04 BuildID: 700474 PackageID: 23.0.3-1 $ 02/05/24 21:54:52 ** $CondorPlatform: x86_64_AlmaLinux8 $ 02/05/24 21:54:52 ** PID = 32 02/05/24 21:54:52 ** Log last touched time unavailable (No such file or directory) 02/05/24 21:54:52 ****************************************************** 02/05/24 21:54:52 Using config source: /etc/condor/condor_config 02/05/24 21:54:52 Using local config sources: 02/05/24 21:54:52 /etc/condor/config.d/00-htcondor-9.0.config 02/05/24 21:54:52 /etc/condor/config.d/01-env.conf 02/05/24 21:54:52 /etc/condor/config.d/01-misc.conf 02/05/24 21:54:52 /etc/condor/config.d/01-role.conf 02/05/24 21:54:52 /etc/condor/config.d/01-security.conf 02/05/24 21:54:52 /etc/condor/config.d/10-stash-plugin.conf 02/05/24 21:54:52 /etc/condor/condor_config.local 02/05/24 21:54:52 config Macros = 83, Sorted = 83, StringBytes = 2826, TablesBytes = 3084 02/05/24 21:54:52 CLASSAD_CACHING is ENABLED 02/05/24 21:54:52 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS 02/05/24 21:54:52 SharedPortEndpoint: waiting for connections to named socket collector 02/05/24 21:54:52 DaemonCore: non-shared command socket at <10.42.96.16:46443?alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local> 02/05/24 21:54:52 Daemoncore: Listening at <0.0.0.0:46443> on TCP (ReliSock) and UDP (SafeSock). 02/05/24 21:54:52 DaemonCore: command socket at <10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector> 02/05/24 21:54:52 DaemonCore: private command socket at <10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector> 02/05/24 21:54:52 In ViewServer::Init() 02/05/24 21:54:52 In CollectorDaemon::Init() 02/05/24 21:54:52 In ViewServer::Config() 02/05/24 21:54:52 In CollectorDaemon::Config() 02/05/24 21:54:52 COLLECTOR_GETAD_OPTIONS set to fast lazy-parse (0x30) 02/05/24 21:54:52 ABSENT_REQUIREMENTS = None 02/05/24 21:54:52 OfflineCollectorPlugin::configure: no persistent store was defined for off-line ads. 02/05/24 21:54:52 enable: Creating stats hash table 02/05/24 21:54:52 Enabling CCB Server. 02/05/24 21:54:52 Will generate a bootstrap file. 02/05/24 21:54:52 Will generate a new certificate file. 02/05/24 21:54:53 CollectorAd : Inserting ** "< My Pool - 10.43.222.99@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx >" 02/05/24 21:55:13 condor_read(): timeout reading 5 bytes from collector 10.43.222.99. 02/05/24 21:55:13 IO: Failed to read packet header 02/05/24 21:55:13 SECMAN: no classad from server, failing 02/05/24 21:55:13 ERROR: SECMAN:2007:Failed to end classad message. 02/05/24 21:55:13 Failed to send update to collector 10.43.222.99. 02/05/24 21:55:13 Unable to send UPDATE_COLLECTOR_AD to all configured collectors 02/05/24 21:55:13 condor_write(): Socket closed when trying to write 565 bytes to <10.42.240.0:21038>, fd is 14 02/05/24 21:55:13 Buf::write(): condor_write() failed 02/05/24 21:55:13 SECMAN: Error sending response classad to <10.42.240.0:21038>! AuthMethods = "FS,TOKEN,PASSWORD" Authentication = "REQUIRED" Command = 19 ConnectSinful = "<10.43.222.99:9618>" CryptoMethods = "AES,BLOWFISH,3DES" ECDHPublicKey = "BILzoNAGQHRbtX38YiEBurMAY3L9pVl1DUVwY61Buf5GMX9An4MI9lzfxJgH2LUuuSfQfKJj6sXnKEay3zurtc8=" Enact = "NO" Encryption = "REQUIRED" Integrity = "REQUIRED" IssuerKeys = "POOL" NegotiatedSession = true NewSession = "YES" OutgoingNegotiation = "REQUIRED" ParentUniqueID = "htcondor-cm-0:29:1707170091" RemoteVersion = "$CondorVersion: 23.0.3 2024-01-04 BuildID: 700474 PackageID: 23.0.3-1 $" ServerCommandSock = "<10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector>" ServerPid = 32 SessionDuration = "86400" SessionLease = 3600 Subsystem = "COLLECTOR" TrustDomain = "htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local" 02/05/24 21:55:13 condor_write(): Socket closed when trying to write 13 bytes to <10.42.96.15:33973>, fd is 13 02/05/24 21:55:13 Buf::write(): condor_write() failed 02/05/24 21:55:13 AUTHENTICATE: handshake failed! 02/05/24 21:55:13 DC_AUTHENTICATE: required authentication of 10.42.96.15 failed: AUTHENTICATE:1002:Failure performing handshake 02/05/24 21:55:13 WARNING: forward resolution of _gateway doesn't match 10.42.240.0! 02/05/24 21:55:13 PERMISSION DENIED to condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host 10.42.240.0 for command 10 (QUERY_STARTD_PVT_ADS), access level NEGOTIATOR: reason: NEGOTIATOR authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.42.240.0, hostname size = 0, original ip address = 10.42.240.0 02/05/24 21:55:13 DC_AUTHENTICATE: Command not authorized, done! 02/05/24 21:55:13 WARNING: forward resolution of _gateway doesn't match 10.42.240.0! 02/05/24 21:55:13 MasterAd : Inserting ** "< htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local >" 02/05/24 21:55:13 PERMISSION DENIED to condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host 10.42.240.0 for command 49 (UPDATE_NEGOTIATOR_AD), access level NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the full reason 02/05/24 21:55:13 DC_AUTHENTICATE: Command not authorized, done! 02/05/24 21:55:13 MasterAd : Inserting ** "< htcondor-submit-0.htcondor-submit.htcondor.svc.cluster.local >" 02/05/24 21:55:13 MasterAd : Inserting ** "< htcondor-worker-64d86c7497-f5wsp >" 02/05/24 21:55:13 StartdAd : Inserting ** "< slot1@htcondor-worker-64d86c7497-f5wsp >" 02/05/24 21:55:13 StartdPvtAd : Inserting ** "< slot1@htcondor-worker-64d86c7497-f5wsp >" 02/05/24 21:55:23 PERMISSION DENIED to condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host 10.42.240.0 for command 10 (QUERY_STARTD_PVT_ADS), access level NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the full reason 02/05/24 21:55:23 DC_AUTHENTICATE: Command not authorized, done! ... ``` My custom configuration should be fully captured in `templates/configmap.yaml` which is mounted to `/etc/condor/condor_config.local` in each container. Any assistance in how to proceed with debugging this would be appreciated. _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://urldefense.com/v3/__https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users__;!!DZ3fjg!8qRLnBvfwHx0c_Vy3vS-BntGVdLyaBmpVPJWmngwwV9s0_6-mnATdFAu9EFv8Q-ZLLHzErD4KwMBzbVmVRI7wQ$ The archives can be found at: https://urldefense.com/v3/__https://lists.cs.wisc.edu/archive/htcondor-users/__;!!DZ3fjg!8qRLnBvfwHx0c_Vy3vS-BntGVdLyaBmpVPJWmngwwV9s0_6-mnATdFAu9EFv8Q-ZLLHzErD4KwMBzbWew4iQWg$ |