Hi,
I have a job that
starts a docker container after some setup. In order to stop and remove
the container if the job is cancelled, I've added pythonÂsignal handlers
to intercept SIGTERM and shutdown the container. When I run the script
from the command line and `kill` it I see it log the signal number (15)
and stop the container.
However, when I run the
same script as a condor job and use schedd.act() to Remove it, there
are no logs from the signal handler, which I assume means condor is
sending a SIGKILL. I have graceful shutdown [1] enabled in the job
ClassAd:
   ÂWantGracefulRemoval = true;Â
I thought this meant that the process should get a SIGTERM, but that doesn't appear to be happening. The max vacate time is 10m, which should be plenty:
[root@95754e4e2337 condor_workdir]# condor_config_val MachineMaxVacateTime
10 * 60
Based on the above I'm not sure why my script isn't getting a SIGTERM nor how to debug further - could anyone provide any hints?
Thanks in advance,Â
-g