Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] MaxVacateTime and KILLING_TIMEOUT seemingly not honored
- Date: Tue, 1 Sep 2020 11:26:17 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] MaxVacateTime and KILLING_TIMEOUT seemingly not honored
Hi Alec,
I am getting a little lost when reading the below, please help
clarify for me....
My understanding is as follows:
You are using startd rank (i.e. RANK = XXX in your condor config)
to prefer running specific jobs submitted by a specific user.
When these jobs run and you then issue a condor_rm, the job is
sent a SIGTERM but is not sent a SIGKILL after 10 minutes (your
MaxVacateTime). However, when you submit a test script as the
same specific user, your test script indeed does receive the
SIGKILL 10 minutes after SIGTERM.
Did I get it right?
If so, it may help to focus on any differences between how you
submitted your test job and how your real jobs are submitted. For
example, perhaps your test job is universe=vanilla and your real
jobs are universe=docker? Also what version of HTCondor are you
using?
Thanks
Todd
On 9/1/2020 9:58 AM, Alec Sheperd wrote:
Hello,
I've been having a long standing issue with our Condor cluster
that I have not been able to crack, primarily pertaining to jobs
not being issued SIGKILL after having be allocated the time
specified in MaxVacateTime.
Some background info: There are certain jobs that need to run
corresponding with specific events that occur. In order to satisfy
this, we have rank preemption set up for these jobs that get
submitted under a specific user to have them start ASAP. I'm not
100% knowledgeable on the code being run, but the general idea is
that these jobs will run until removed by other means (i.e. they
will never exit of their own accord). This normally has just been
done by issuing a condor_rm once the work they are doing has been
deemed complete.
In more recent times, either due to changes in the host machines,
condor configuration, or the code itself, the jobs will never get
removed via condor_rm, and have to be killed locally on the
execute host by issuing a KILLSIG to both the starter and child
process.
The child process does not properly handle SIGTERM, and for reason
beyond my scope, I cannot do much to change this on the code side.
However, it seems strange to me that a SIGKILL does not seem to be
sent after reaching the MaxVacateTime which is set to
MaxVacateTime = 10 * $(MINUTE). Not only that, but the
KILLING_TIMEOUT for the startd does not seem honored either, which
at the default 30 seconds. Watching with a strace, it seems to
confirm that the SIGKILLS are never issued in these cases.
I've tested it with scripts like
#!/bin/bash
trap "echo 'do nothing'" SIGTERM
while :; do :; done
Which seems to work however, so I'm not sure. I've wondered if
rank expressions prevent this from happening? Running as the user
with rank preemption for the above script still seems to do the
correct thing ultimately though.
Any thoughts or ideas to test would be greatly appreciated!
Alec
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/