Daniel, Thanks for the reply,
Your analysis here is wrong. The STARTER gets SIGTERM. It then kills the job.
Note that the SIGQUIT comes after the job has exited. This is part of the normal termination of the STARTER by the STARTD after the job has finished. The STARTER doesn't know why the job exited, only that it did.
I see.. so regardless of the job's exit status, the starter only knows that a job has exited, and the startd then terminates the starter?
Why are administrators killing Condor jobs? Note I don't say sending them SIGQUIT because that isn't what is happening, they are killing the jobs outside of Condor. Why aren't they using condor_vacate or condor_vacate_job for this purpose? There is no way for Condor to know why the job exited otherwise.
The problem is that the majority of our machine owners are also local administrators for those machines, and the pool is too big and varied to instruct everyone on condor_vacate and suspension policy settings. So what happens sometimes is that a machine owner logs in and kills a suspended condor_exec process to reclaim resources.
We could default to a want_suspend = false policy, eliminating the need for local administrators to reclaim resources, but since most jobs do not checkpoint so we'd prefer to have suspension where possible.
If we can't "catch" jobs that are being killed outside condor, I suppose the only way is to re-queue them after reviewing the logs with non-zero return values?
Thanks, Rob de Graaf