Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] HTCondor Helm chart
- Date: Mon, 5 Feb 2024 16:31:19 -0600
- From: "T. Andrew Manning" <manninga@xxxxxxxxxxxx>
- Subject: [HTCondor-users] HTCondor Helm chart
I am attempting to deploy HTCondor on Kubernetes using the official
Docker images by [constructing a Helm
chart](https://gitlab.com/manning-ncsa/htcondor-helm-chart/). Currently
this Helm chart is a draft, work-in-progress that I only published to
solicit feedback. (Don't worry about the password in
`templates/secrets.yaml`; there is no ingress to the cluster and I will
replace the password when it matters).
The three pods are able to start, and I can submit a job via the access
point (i.e. the "submit" pod), but the job remains in an Idle state.
```
ÂÂÂ $ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c 'cd
/tmp/htcondor && condor_submit sleep.sub'
ÂÂÂ Submitting job(s).
ÂÂÂ 1 job(s) submitted to cluster 1.
ÂÂÂ $ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c
'condor_status'
ÂÂÂ NameÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ OpSysÂÂÂÂÂ Arch StateÂÂÂÂ
Activity LoadAv MemÂÂ ActvtyTime
ÂÂÂ slot1@htcondor-worker-64d86c7497-f5wsp LINUXÂÂÂÂÂ X86_64 Unclaimed
IdleÂÂÂÂÂ 0.000 1024Â 0+00:00:00
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Total Owner Claimed Unclaimed Matched PreemptingÂ
Drain Backfill BkIdle
ÂÂÂ X86_64/LINUXÂÂÂÂ 1ÂÂÂÂ 0ÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂ 1ÂÂÂÂÂÂ 0 0ÂÂÂÂÂ 0ÂÂÂÂÂÂÂ
0ÂÂÂÂÂ 0
ÂÂÂÂÂÂÂÂÂÂÂ TotalÂÂÂÂ 1ÂÂÂÂ 0ÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂ 1ÂÂÂÂÂÂ 0 0ÂÂÂÂÂ 0ÂÂÂÂÂÂÂ
0ÂÂÂÂÂ 0
ÂÂÂ $ kubectl exec -it -n htcondor htcondor-submit-0 -- bash -c 'condor_q'
ÂÂÂ -- Schedd:
htcondor-submit-0.htcondor-submit.htcondor.svc.cluster.local :
<10.42.96.15:38553?... @ 02/05/24 21:55:37
ÂÂÂ OWNERÂ BATCH_NAMEÂÂÂ SUBMITTEDÂÂ DONEÂÂ RUNÂÂÂ IDLEÂ TOTAL JOB_IDS
ÂÂÂ condor ID: 1ÂÂÂÂÂÂÂ 2/5Â 21:55ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1ÂÂÂÂÂ 1 1.0
ÂÂÂ Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running,
0 held, 0 suspended
ÂÂÂ Total for all users: 1 jobs; 0 completed, 0 removed, 1 idle, 0
running, 0 held, 0 suspended
```
When I look at the CollectorLog on the Central Manager (cm):
```
ÂÂÂ $ kubectl exec -it -n htcondor htcondor-cm-0 -- bash -c 'less
/var/log/condor/CollectorLog'
```
I see a mix of PERMISSION_DENIED messages and messages about
communication timeouts and failed writes that look like network failures:
```
ÂÂÂ $ kubectl exec -it -n htcondor htcondor-cm-0 -- bash -c 'cat
/var/log/condor/CollectorLog'
ÂÂÂ 02/05/24 21:54:52 Setting maximum file descriptors to 10240.
ÂÂÂ 02/05/24 21:54:52
******************************************************
ÂÂÂ 02/05/24 21:54:52 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
ÂÂÂ 02/05/24 21:54:52 ** /usr/sbin/condor_collector
ÂÂÂ 02/05/24 21:54:52 ** SubsystemInfo: name=COLLECTOR
type=COLLECTOR(2) class=DAEMON(1)
ÂÂÂ 02/05/24 21:54:52 ** Configuration: subsystem:COLLECTOR
local:<NONE> class:DAEMON
ÂÂÂ 02/05/24 21:54:52 ** $CondorVersion: 23.0.3 2024-01-04 BuildID:
700474 PackageID: 23.0.3-1 $
ÂÂÂ 02/05/24 21:54:52 ** $CondorPlatform: x86_64_AlmaLinux8 $
ÂÂÂ 02/05/24 21:54:52 ** PID = 32
ÂÂÂ 02/05/24 21:54:52 ** Log last touched time unavailable (No such
file or directory)
ÂÂÂ 02/05/24 21:54:52
******************************************************
ÂÂÂ 02/05/24 21:54:52 Using config source: /etc/condor/condor_config
ÂÂÂ 02/05/24 21:54:52 Using local config sources:
ÂÂÂ 02/05/24 21:54:52 /etc/condor/config.d/00-htcondor-9.0.config
ÂÂÂ 02/05/24 21:54:52ÂÂÂ /etc/condor/config.d/01-env.conf
ÂÂÂ 02/05/24 21:54:52ÂÂÂ /etc/condor/config.d/01-misc.conf
ÂÂÂ 02/05/24 21:54:52ÂÂÂ /etc/condor/config.d/01-role.conf
ÂÂÂ 02/05/24 21:54:52ÂÂÂ /etc/condor/config.d/01-security.conf
ÂÂÂ 02/05/24 21:54:52ÂÂÂ /etc/condor/config.d/10-stash-plugin.conf
ÂÂÂ 02/05/24 21:54:52ÂÂÂ /etc/condor/condor_config.local
ÂÂÂ 02/05/24 21:54:52 config Macros = 83, Sorted = 83, StringBytes =
2826, TablesBytes = 3084
ÂÂÂ 02/05/24 21:54:52 CLASSAD_CACHING is ENABLED
ÂÂÂ 02/05/24 21:54:52 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
ÂÂÂ 02/05/24 21:54:52 SharedPortEndpoint: waiting for connections to
named socket collector
ÂÂÂ 02/05/24 21:54:52 DaemonCore: non-shared command socket at
<10.42.96.16:46443?alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local>
ÂÂÂ 02/05/24 21:54:52 Daemoncore: Listening at <0.0.0.0:46443> on TCP
(ReliSock) and UDP (SafeSock).
ÂÂÂ 02/05/24 21:54:52 DaemonCore: command socket at
<10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector>
ÂÂÂ 02/05/24 21:54:52 DaemonCore: private command socket at
<10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector>
ÂÂÂ 02/05/24 21:54:52 In ViewServer::Init()
ÂÂÂ 02/05/24 21:54:52 In CollectorDaemon::Init()
ÂÂÂ 02/05/24 21:54:52 In ViewServer::Config()
ÂÂÂ 02/05/24 21:54:52 In CollectorDaemon::Config()
ÂÂÂ 02/05/24 21:54:52 COLLECTOR_GETAD_OPTIONS set to fast lazy-parse (0x30)
ÂÂÂ 02/05/24 21:54:52 ABSENT_REQUIREMENTS = None
ÂÂÂ 02/05/24 21:54:52 OfflineCollectorPlugin::configure: no persistent
store was defined for off-line ads.
ÂÂÂ 02/05/24 21:54:52 enable: Creating stats hash table
ÂÂÂ 02/05/24 21:54:52 Enabling CCB Server.
ÂÂÂ 02/05/24 21:54:52 Will generate a bootstrap file.
ÂÂÂ 02/05/24 21:54:52 Will generate a new certificate file.
 02/05/24 21:54:53 CollectorAd : Inserting ** "< My Pool -
10.43.222.99@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx >"
ÂÂÂ 02/05/24 21:55:13 condor_read(): timeout reading 5 bytes from
collector 10.43.222.99.
ÂÂÂ 02/05/24 21:55:13 IO: Failed to read packet header
ÂÂÂ 02/05/24 21:55:13 SECMAN: no classad from server, failing
ÂÂÂ 02/05/24 21:55:13 ERROR: SECMAN:2007:Failed to end classad message.
ÂÂÂ 02/05/24 21:55:13 Failed to send update to collector 10.43.222.99.
ÂÂÂ 02/05/24 21:55:13 Unable to send UPDATE_COLLECTOR_AD to all
configured collectors
ÂÂÂ 02/05/24 21:55:13 condor_write(): Socket closed when trying to
write 565 bytes to <10.42.240.0:21038>, fd is 14
ÂÂÂ 02/05/24 21:55:13 Buf::write(): condor_write() failed
ÂÂÂ 02/05/24 21:55:13 SECMAN: Error sending response classad to
<10.42.240.0:21038>!
ÂÂÂ AuthMethods = "FS,TOKEN,PASSWORD"
ÂÂÂ Authentication = "REQUIRED"
ÂÂÂ Command = 19
ÂÂÂ ConnectSinful = "<10.43.222.99:9618>"
ÂÂÂ CryptoMethods = "AES,BLOWFISH,3DES"
ÂÂÂ ECDHPublicKey =
"BILzoNAGQHRbtX38YiEBurMAY3L9pVl1DUVwY61Buf5GMX9An4MI9lzfxJgH2LUuuSfQfKJj6sXnKEay3zurtc8="
ÂÂÂ Enact = "NO"
ÂÂÂ Encryption = "REQUIRED"
ÂÂÂ Integrity = "REQUIRED"
ÂÂÂ IssuerKeys = "POOL"
ÂÂÂ NegotiatedSession = true
ÂÂÂ NewSession = "YES"
ÂÂÂ OutgoingNegotiation = "REQUIRED"
ÂÂÂ ParentUniqueID = "htcondor-cm-0:29:1707170091"
ÂÂÂ RemoteVersion = "$CondorVersion: 23.0.3 2024-01-04 BuildID: 700474
PackageID: 23.0.3-1 $"
ÂÂÂ ServerCommandSock =
"<10.42.96.16:9618?addrs=10.42.96.16-9618&alias=htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local&noUDP&sock=collector>"
ÂÂÂ ServerPid = 32
ÂÂÂ SessionDuration = "86400"
ÂÂÂ SessionLease = 3600
ÂÂÂ Subsystem = "COLLECTOR"
ÂÂÂ TrustDomain = "htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local"
ÂÂÂ 02/05/24 21:55:13 condor_write(): Socket closed when trying to
write 13 bytes to <10.42.96.15:33973>, fd is 13
ÂÂÂ 02/05/24 21:55:13 Buf::write(): condor_write() failed
ÂÂÂ 02/05/24 21:55:13 AUTHENTICATE: handshake failed!
ÂÂÂ 02/05/24 21:55:13 DC_AUTHENTICATE: required authentication of
10.42.96.15 failed: AUTHENTICATE:1002:Failure performing handshake
ÂÂÂ 02/05/24 21:55:13 WARNING: forward resolution of _gateway doesn't
match 10.42.240.0!
ÂÂÂ 02/05/24 21:55:13 PERMISSION DENIED to
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host
10.42.240.0 for command 10 (QUERY_STARTD_PVT_ADS), access level
NEGOTIATOR: reason: NEGOTIATOR authorization policy contains no matching
ALLOW entry for this request; identifiers used for this host:
10.42.240.0, hostname size = 0, original ip address = 10.42.240.0
ÂÂÂ 02/05/24 21:55:13 DC_AUTHENTICATE: Command not authorized, done!
ÂÂÂ 02/05/24 21:55:13 WARNING: forward resolution of _gateway doesn't
match 10.42.240.0!
ÂÂÂ 02/05/24 21:55:13 MasterAdÂÂÂÂ : Inserting ** "<
htcondor-cm-0.htcondor-cm.htcondor.svc.cluster.local >"
ÂÂÂ 02/05/24 21:55:13 PERMISSION DENIED to
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host
10.42.240.0 for command 49 (UPDATE_NEGOTIATOR_AD), access level
NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the
full reason
ÂÂÂ 02/05/24 21:55:13 DC_AUTHENTICATE: Command not authorized, done!
ÂÂÂ 02/05/24 21:55:13 MasterAdÂÂÂÂ : Inserting ** "<
htcondor-submit-0.htcondor-submit.htcondor.svc.cluster.local >"
ÂÂÂ 02/05/24 21:55:13 MasterAdÂÂÂÂ : Inserting ** "<
htcondor-worker-64d86c7497-f5wsp >"
ÂÂÂ 02/05/24 21:55:13 StartdAdÂÂÂÂ : Inserting ** "<
slot1@htcondor-worker-64d86c7497-f5wsp >"
 02/05/24 21:55:13 StartdPvtAd : Inserting ** "<
slot1@htcondor-worker-64d86c7497-f5wsp >"
ÂÂÂ 02/05/24 21:55:23 PERMISSION DENIED to
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from host
10.42.240.0 for command 10 (QUERY_STARTD_PVT_ADS), access level
NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the
full reason
ÂÂÂ 02/05/24 21:55:23 DC_AUTHENTICATE: Command not authorized, done!
ÂÂÂ ...
```
My custom configuration should be fully captured in
`templates/configmap.yaml` which is mounted to
`/etc/condor/condor_config.local` in each container.
Any assistance in how to proceed with debugging this would be appreciated.