[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Windows Claim Deactivation



Tim and Jamie,

> Are these logs from Windows?
The logs I provided were from the Windows execution point.

> Does the job keep running?
Yes, the job continues to run until the max vacate time is hit and the execution is forcefully removed.Â

> What is the prior working version?
I was migrating this system from v8.4.11 to v24.0.6. Apologies that this is quite a large amount of changes over a large time window.

> How do you expect the Windows jobs to be killed?
I would hope all kill_sigÂoptions would be coalesced into the behavior SIGTERM appears to have where the windows softkill program is invoked and WM_CLOSE is sent to the external job process.

Thank you for the prompt response! Let me know if any other information would beÂuseful.

Best,

T. Rock

On Tue, May 6, 2025 at 1:21âPM Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
It looks like despite what the manual claims, the kill_sig job attribute is used on both unix and Windows for all job universes, despite it being nonsensical on Windows. What version are you upgrading from?

Â- Jaime

On May 6, 2025, at 12:38âAM, Thomas Rock <drthomasrock@xxxxxxxxx> wrote:

Hello!

With HTCondor
v24.0.6, I am seeing unexpected behavior when attempting to deactivate a claim for a statically-provisioned, Windows Server 2019 execution point with ENABLE_STARTD_DAEMON_ADÂset to False.ÂI have many platform-agnostic executables that specify kill_sig=SIGINT as part of their submission. When migrating to newer versions, removal of claims for Windows execution points stopped working despite the docs stating Windows does not consider kill_sig. Here is some example logging I see:

==> StarterLog.slot1 <==
(pid:1084) Got SIGTERM. Performing graceful shutdown.
(pid:1084) ShutdownGraceful all jobs.
(pid:1084) Send_Signal: ERROR Attempt to send signal 2 to pid 6064, but pid 6064 has no command socket # This is the job's PID
(pid:1084) Send (softkill) signal failed, retrying...

=> StartLog <==
slot1: State change: received VACATE_CLAIM command
slot1: Changing activity: Busy -> Retiring
slot1: State change: claim retirement ended/expired
slotl: Changing state and activity: Claimed/Retiring -> Preempting/Vacating

==> StarterLog.slot1 <==
(pid:1084) Send_Signal: ERROR Attempt to send signal 2 to pid 6064, but pid 6064 has no command socket
(pid:1084) Send (softkill) signal failed twice, hardkill will fire after timeoutÂ

I believe this could be related to PR #665, but I am not sure if it is a misconfiguration.ÂAny help would be greatly appreciated!Â

Let me know if any other logging would be helpful with diagnosing this.Â

Thanks,

T. Rock
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

Join us in June at Throughput Computing 25: https://urldefense.com/v3/__https://osg-htc.org/htc25__;!!Mak6IKo!L1RHy6sHvkXgt6rmRzyy2dkhuff0UuE8aKemwgz36WZXwmrlnf-H0kXdTmdb-KarhT-msQbkoU5SlW5vD6Z5ZDY$

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

Join us in June at Throughput Computing 25: https://osg-htc.org/htc25

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/