[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Send SIGTERM vs SIGKILL on schedd.act(htcondor2.JobAction.Remove)



Hi,

I have a job that starts a docker container after some setup. In order to stop and remove the container if the job is cancelled, I've added pythonÂsignal handlers to intercept SIGTERM and shutdown the container. When I run the script from the command line and `kill` it I see it log the signal number (15) and stop the container.

However, when I run the same script as a condor job and use schedd.act() to Remove it, there are no logs from the signal handler, which I assume means condor is sending a SIGKILL. I have graceful shutdown [1] enabled in the job ClassAd:

   ÂWantGracefulRemoval = true;Â

I thought this meant that the process should get a SIGTERM, but that doesn't appear to be happening. The max vacate time is 10m, which should be plenty:

[root@95754e4e2337 condor_workdir]# condor_config_val MachineMaxVacateTime
10 * 60

Based on the above I'm not sure why my script isn't getting a SIGTERM nor how to debug further - could anyone provide any hints?

Thanks in advance,Â

-g

[1]Âhttps://htcondor.readthedocs.io/en/v8_8/man-pages/condor_submit.html#want-graceful-removal