[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Send SIGTERM vs SIGKILL on schedd.act(htcondor2.JobAction.Remove)



Hi Gavin,

By default (without setting want_graceful_shutdown), a vanilla universe job should receive a SIGTERM when a job is removed from the AP (regardless of using the python API or not). I will say that the visibility into the signal sent at job removal time is not so great as Condor marks everything as removed, and the standard output/error of the job is not transferred back. I had to open a file to a well known location to write debugging information to convince myself the starter was in fact sending a SIGTERM. Here is the simple script and job template I used to test on a mini-condor running under MacOS:

===Test Script ===
#!/usr/bin/env python3

import os
import sys
import signal
from time import sleep

def handler(sig, frame):
    print(f"Received signal {sig} ({signal.Signals(sig).name})")
    with open("/Users/colebollig/Desktop/tests/signal/sig.caught", "w") as f:
        f.write(f"Signal {sig}: {signal.Signals(sig).name}\n")
    sys.exit(0)

signal.signal(signal.SIGTERM, handler)

pid = os.getpid()
print(f"PID: {pid}")

delay = 60
if len(sys.argv) == 2:
    delay = int(sys.argv[1])

sleep(delay)

sys.exit(1)


=== Job Template ===
executable = sigtest.py
arguments  = "300"

output     = job.debug
error      = job.debug
log        = job.log

queue

-----

Also, I am not sure what setup you are doing before executing the docker container, but I would recommend using docker universe (if possible) to let Condor manage the Docker Containers on your behalf. We try to advise people to not use wrapper scripts as it can obscure various things away from Condor's gaze potentially leading to weird issues.

Cheers,
Cole Bollig


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Gavin Price <gaprice@xxxxxxx>
Sent: Saturday, January 3, 2026 7:18 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Send SIGTERM vs SIGKILL on schedd.act(htcondor2.JobAction.Remove)
 
Hi,

I have a job that starts a docker container after some setup. In order to stop and remove the container if the job is cancelled, I've added python signal handlers to intercept SIGTERM and shutdown the container. When I run the script from the command line and `kill` it I see it log the signal number (15) and stop the container.

However, when I run the same script as a condor job and use schedd.act() to Remove it, there are no logs from the signal handler, which I assume means condor is sending a SIGKILL. I have graceful shutdown [1] enabled in the job ClassAd:

       WantGracefulRemoval = true; 

I thought this meant that the process should get a SIGTERM, but that doesn't appear to be happening. The max vacate time is 10m, which should be plenty:

[root@95754e4e2337 condor_workdir]# condor_config_val MachineMaxVacateTime
10 * 60

Based on the above I'm not sure why my script isn't getting a SIGTERM nor how to debug further - could anyone provide any hints?

Thanks in advance, 

-g