Hello!
With HTCondor v24.0.6, I am seeing unexpected behavior when attempting to deactivate a claim for a statically-provisioned, Windows Server 2019 execution point with
ENABLE_STARTD_DAEMON_AD set to
False. I have many platform-agnostic executables that specify
kill_sig=SIGINT as part of their submission. When migrating to newer versions, removal of claims for Windows execution points stopped working despite the docs stating Windows does not consider
kill_sig. Here is some example logging I see:
==> StarterLog.slot1 <==
(pid:1084) Got SIGTERM. Performing graceful shutdown.
(pid:1084) ShutdownGraceful all jobs.
(pid:1084) Send_Signal: ERROR Attempt to send signal 2 to pid 6064, but pid 6064 has no command socket # This is the job's PID
(pid:1084) Send (softkill) signal failed, retrying...
=> StartLog <==
slot1: State change: received VACATE_CLAIM command
slot1: Changing activity: Busy -> Retiring
slot1: State change: claim retirement ended/expired
slotl: Changing state and activity: Claimed/Retiring -> Preempting/Vacating
==> StarterLog.slot1 <==
(pid:1084) Send_Signal: ERROR Attempt to send signal 2 to pid 6064, but pid 6064 has no command socket
(pid:1084) Send (softkill) signal failed twice, hardkill will fire after timeout
I believe this could be related to
PR #665, but I am not sure if it is a misconfiguration. Any help would be greatly appreciated!
Let me know if any other logging would be helpful with diagnosing this.
Thanks,
T. Rock
Join us in June at Throughput Computing 25: https://urldefense.com/v3/__https://osg-htc.org/htc25__;!!Mak6IKo!L1RHy6sHvkXgt6rmRzyy2dkhuff0UuE8aKemwgz36WZXwmrlnf-H0kXdTmdb-KarhT-msQbkoU5SlW5vD6Z5ZDY$
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/