Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] limit defrag to particular partitionable slots?
- Date: Thu, 18 Jul 2024 16:19:06 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] limit defrag to particular partitionable slots?
On 7/18/2024 1:07 PM, R Tapia wrote:
Thanks for the reply. I saw this, but the documentation seems to
indicate that this is a per machine (condor_startd) decision. It
sounds like I could use this to exclude an entire machine from
defragmentation, but not drain jobs in one slot but not the other
slot.
Is that right or am I misreading the documentation?
Hi Ron,
I think you understood it correctly.
Draining is currently an operation performed on an entire EP
(startd), not on a particular slot.
However, you can configure things in such a way to (mostly) get what
you want. When you issue a condor_drain command (or when the defrag
daemon does so), essentially two things happen:
1. Any job that has run longer than the MaxJobRetirementTime on
that slot will be sent a SIGTERM (assuming the default 'graceful'
style of a drain)
2. By default, the Requirements _expression_ on every slot will
switch to False (so no new jobs will match the slot)
Imagine you want a startd with two partitionable slots (pslots), one
name "DrainMe" that allows draining and another named "NoDrain" that
does not allow draining. The trick is to configure the startd so
that the NoDrain slots have a very large MaxJobRetirementTime to
negate item #1 above. To negate item #2, when a drain command is
issued, you can provide a custom Requirements clause that will allow
matches to continue to happen on pslot NoDrain.
To demonstrate, I fired up the minicondor container, configured two
pslots as described above, submitted a bunch of jobs, then issued a
drain, saw the jobs in the NoDrain slot were still running, removed
a running job and observed that the NoDrain slot happily picked up
another job to run :
% docker run -it --rm --hostname
test1.toddt.org htcondor/mini bash
[root@test1 /]# cat - > /etc/condor/config.d/50-todd-test
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = 50%
SLOT_TYPE_1_Name_Prefix = NoDrain
SLOT_TYPE_1_MaxJobRetirementTime = 999999999
SLOT_TYPE_1_PARTITIONABLE = True
NUM_SLOTS_TYPE_2 = 1
SLOT_TYPE_2 = 50%
SLOT_TYPE_2_Name_Prefix = DrainMe
SLOT_TYPE_2_MaxJobRetirementTime = 0
SLOT_TYPE_2_PARTITIONABLE = True
<ctrl-d>
[root@test1 /]# condor_master
[root@test1 /]# condor_status
Name OpSys Arch State Activity
LoadAv Mem ActvtyTime
DrainMe2@xxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle
0.000 5941 0+00:00:05
NoDrain1@xxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle
0.000 5941 0+00:00:05
Total Owner Claimed Unclaimed Matched Preempting
Drain Backfill BkIdle
X86_64/LINUX 2 0 0 2 0
0 0 0 0
Total 2 0 0 2 0
0 0 0 0
[root@test1 /]# condor_status -af:h Name SlotId
MaxJobRetirementTime
Name SlotId MaxJobRetirementTime
DrainMe2@xxxxxxxxxxxxxxx 2 0
NoDrain1@xxxxxxxxxxxxxxx 1 999999999
[root@test1 /]# condor_submit executable=/bin/sleep
arguments=1000000 -queue 500
Submitting
job(s)....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
500 job(s) submitted to cluster 1.
[root@test1 /]# condor_status
Name OpSys Arch State Activity
LoadAv Mem ActvtyTime
DrainMe2@xxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle
0.000 5429 0+00:03:45
DrainMe2_1@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy
0.000 128 0+00:01:05
DrainMe2_2@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy
0.000 128 0+00:01:05
DrainMe2_3@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy
0.000 128 0+00:01:04
DrainMe2_4@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy
0.000 128 0+00:01:04
NoDrain1@xxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle
0.000 5429 0+00:03:45
NoDrain1_1@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy
0.000 128 0+00:01:05
NoDrain1_2@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy
0.000 128 0+00:01:05
NoDrain1_3@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy
0.000 128 0+00:01:05
NoDrain1_4@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy
0.000 128 0+00:01:04
Total Owner Claimed Unclaimed Matched Preempting
Drain Backfill BkIdle
X86_64/LINUX 10 0 8 2 0
0 0 0 0
Total 10 0 8 2 0
0 0 0 0
[root@test1 /]# condor_drain -graceful -start 'SlotId == 1 ?
True : False' test1.toddt.org
Sent request to drain the startd
<127.0.0.1:9618?addrs=127.0.0.1-9618&alias=test1.toddt.org&noUDP&sock=startd_27_698d>
with test1.toddt.org. This only affects the single startd; any
other startds running on the same host will not be drained.
[root@test1 /]# condor_status
Name OpSys Arch State Activity
LoadAv Mem ActvtyTime
DrainMe2@xxxxxxxxxxxxxxx LINUX X86_64 Drained Retiring
0.000 5941 0+00:00:05
NoDrain1@xxxxxxxxxxxxxxx LINUX X86_64 Drained Retiring
0.000 5429 0+00:00:05
NoDrain1_1@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Retiring
0.000 128 0+00:00:05
NoDrain1_2@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Retiring
0.000 128 0+00:00:05
NoDrain1_3@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Retiring
0.000 128 0+00:00:05
NoDrain1_4@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Retiring
0.000 128 0+00:00:05
Total Owner Claimed Unclaimed Matched Preempting
Drain Backfill BkIdle
X86_64/LINUX 6 0 4 0 0
0 2 0 0
Total 6 0 4 0 0
0 2 0 0
[root@test1 /]# condor_q -run
-- Schedd: test1.toddt.org : <127.0.0.1:33257?... @ 07/18/24
21:12:34
ID OWNER SUBMITTED RUN_TIME HOST(S)
1.0 condor 7/18 21:07 0+00:04:38
NoDrain1_1@xxxxxxxxxxxxxxx
1.2 condor 7/18 21:07 0+00:04:38
NoDrain1_2@xxxxxxxxxxxxxxx
1.4 condor 7/18 21:07 0+00:04:38
NoDrain1_3@xxxxxxxxxxxxxxx
1.6 condor 7/18 21:07 0+00:04:38
NoDrain1_4@xxxxxxxxxxxxxxx
[root@test1 /]# condor_rm 1.6 # remove the job running
on slot NoDrain1_4
Job 1.6 marked for removal
[root@test1 /]# condor_status # confirm that slot
NoDrain1_4 picks up a new job
Name OpSys Arch State Activity
LoadAv Mem ActvtyTime
DrainMe2@xxxxxxxxxxxxxxx LINUX X86_64 Drained Retiring
0.000 5941 0+00:03:00
NoDrain1@xxxxxxxxxxxxxxx LINUX X86_64 Drained Retiring
0.000 5429 0+00:03:00
NoDrain1_1@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Retiring
0.000 128 0+00:03:00
NoDrain1_2@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Retiring
0.000 128 0+00:03:00
NoDrain1_3@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Retiring
0.000 128 0+00:03:00
NoDrain1_4@xxxxxxxxxxxxxxx LINUX X86_64 Claimed Busy
0.000 128 0+00:00:03
Total Owner Claimed Unclaimed Matched Preempting
Drain Backfill BkIdle
X86_64/LINUX 6 0 4 0 0
0 2 0 0
Total 6 0 4 0 0
0 2 0 0
Hope this helps, feel free to ask any additional questions.
regards,
Todd
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685