Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] limit defrag to particular partitionable slots?

Date: Thu, 18 Jul 2024 16:19:06 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] limit defrag to particular partitionable slots?

On 7/18/2024 1:07 PM, R Tapia wrote:

Thanks for the reply. I saw this, but the documentation seems to indicate that this is a per machine (condor_startd) decision. It sounds like I could use this to exclude an entire machine from defragmentation, but not drain jobs in one slot but not the other slot.

Is that right or am I misreading the documentation?

Hi Ron,

I think you understood it correctly.

Draining is currently an operation performed on an entire EP (startd), not on a particular slot.

However, you can configure things in such a way to (mostly) get what you want. When you issue a condor_drain command (or when the defrag daemon does so), essentially two things happen:

1. Any job that has run longer than the MaxJobRetirementTime on that slot will be sent a SIGTERM (assuming the default 'graceful' style of a drain)
2. By default, the Requirements _expression_ on every slot will switch to False (so no new jobs will match the slot)

Imagine you want a startd with two partitionable slots (pslots), one name "DrainMe" that allows draining and another named "NoDrain" that does not allow draining. The trick is to configure the startd so that the NoDrain slots have a very large MaxJobRetirementTime to negate item #1 above. To negate item #2, when a drain command is issued, you can provide a custom Requirements clause that will allow matches to continue to happen on pslot NoDrain.

To demonstrate, I fired up the minicondor container, configured two pslots as described above, submitted a bunch of jobs, then issued a drain, saw the jobs in the NoDrain slot were still running, removed a running job and observed that the NoDrain slot happily picked up another job to run :

% docker run -it --rm --hostname test1.toddt.org htcondor/mini bash

[root@test1 /]# cat - > /etc/condor/config.d/50-todd-test
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = 50%
SLOT_TYPE_1_Name_Prefix = NoDrain
SLOT_TYPE_1_MaxJobRetirementTime = 999999999
SLOT_TYPE_1_PARTITIONABLE = True

NUM_SLOTS_TYPE_2 = 1
SLOT_TYPE_2 = 50%
SLOT_TYPE_2_Name_Prefix = DrainMe
SLOT_TYPE_2_MaxJobRetirementTime = 0
SLOT_TYPE_2_PARTITIONABLE = True
<ctrl-d>

[root@test1 /]# condor_master

[root@test1 /]# condor_status
Name                     OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

DrainMe2@xxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 5941 0+00:00:05
NoDrain1@xxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 5941 0+00:00:05

               Total Owner Claimed Unclaimed Matched Preempting Drain Backfill BkIdle

X86_64/LINUX     2     0       0         2       0          0      0        0      0

         Total     2     0       0         2       0          0      0        0      0

[root@test1 /]# condor_status -af:h Name SlotId MaxJobRetirementTime
Name                     SlotId MaxJobRetirementTime
DrainMe2@xxxxxxxxxxxxxxx 2      0
NoDrain1@xxxxxxxxxxxxxxx 1      999999999

[root@test1 /]# condor_submit executable=/bin/sleep arguments=1000000 -queue 500
Submitting job(s)....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
500 job(s) submitted to cluster 1.

[root@test1 /]# condor_status
Name                       OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

DrainMe2@xxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000 5429 0+00:03:45
DrainMe2_1@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000 128 0+00:01:05
DrainMe2_2@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000 128 0+00:01:05
DrainMe2_3@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000 128 0+00:01:04
DrainMe2_4@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000 128 0+00:01:04
NoDrain1@xxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000 5429 0+00:03:45
NoDrain1_1@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000 128 0+00:01:05
NoDrain1_2@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000 128 0+00:01:05
NoDrain1_3@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000 128 0+00:01:05
NoDrain1_4@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000 128 0+00:01:04

               Total Owner Claimed Unclaimed Matched Preempting Drain Backfill BkIdle

X86_64/LINUX    10     0       8         2       0          0      0        0      0

         Total    10     0       8         2       0          0      0        0      0

[root@test1 /]# condor_drain -graceful -start 'SlotId == 1 ? True : False' test1.toddt.org
Sent request to drain the startd <127.0.0.1:9618?addrs=127.0.0.1-9618&alias=test1.toddt.org&noUDP&sock=startd_27_698d> with test1.toddt.org. This only affects the single startd; any other startds running on the same host will not be drained.

[root@test1 /]# condor_status
Name                       OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

DrainMe2@xxxxxxxxxxxxxxx   LINUX      X86_64 Drained   Retiring 0.000 5941 0+00:00:05
NoDrain1@xxxxxxxxxxxxxxx   LINUX      X86_64 Drained   Retiring 0.000 5429 0+00:00:05
NoDrain1_1@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Retiring 0.000 128 0+00:00:05
NoDrain1_2@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Retiring 0.000 128 0+00:00:05
NoDrain1_3@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Retiring 0.000 128 0+00:00:05
NoDrain1_4@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Retiring 0.000 128 0+00:00:05

               Total Owner Claimed Unclaimed Matched Preempting Drain Backfill BkIdle

X86_64/LINUX     6     0       4         0       0          0      2        0      0

         Total     6     0       4         0       0          0      2        0      0

[root@test1 /]# condor_q -run

-- Schedd: test1.toddt.org : <127.0.0.1:33257?... @ 07/18/24 21:12:34
ID      OWNER            SUBMITTED     RUN_TIME HOST(S)
   1.0   condor          7/18 21:07   0+00:04:38 NoDrain1_1@xxxxxxxxxxxxxxx
   1.2   condor          7/18 21:07   0+00:04:38 NoDrain1_2@xxxxxxxxxxxxxxx
   1.4   condor          7/18 21:07   0+00:04:38 NoDrain1_3@xxxxxxxxxxxxxxx
   1.6   condor          7/18 21:07   0+00:04:38 NoDrain1_4@xxxxxxxxxxxxxxx

[root@test1 /]# condor_rm 1.6   # remove the job running on slot NoDrain1_4
Job 1.6 marked for removal

[root@test1 /]# condor_status   # confirm that slot NoDrain1_4 picks up a new job
Name                       OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

DrainMe2@xxxxxxxxxxxxxxx   LINUX      X86_64 Drained   Retiring 0.000 5941 0+00:03:00
NoDrain1@xxxxxxxxxxxxxxx   LINUX      X86_64 Drained   Retiring 0.000 5429 0+00:03:00
NoDrain1_1@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Retiring 0.000 128 0+00:03:00
NoDrain1_2@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Retiring 0.000 128 0+00:03:00
NoDrain1_3@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Retiring 0.000 128 0+00:03:00
NoDrain1_4@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000 128 0+00:00:03

               Total Owner Claimed Unclaimed Matched Preempting Drain Backfill BkIdle

X86_64/LINUX     6     0       4         0       0          0      2        0      0

         Total     6     0       4         0       0          0      2        0      0

Hope this helps, feel free to ask any additional questions.
regards,
Todd

-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685

References:
- [HTCondor-users] limit defrag to particular partitionable slots?
  - From: R Tapia
- Re: [HTCondor-users] limit defrag to particular partitionable slots?
  - From: Beyer, Christoph
- Re: [HTCondor-users] limit defrag to particular partitionable slots?
  - From: R Tapia

Prev by Date: Re: [HTCondor-users] limit defrag to particular partitionable slots?
Next by Date: Re: [HTCondor-users] Is it possible to flock between a htcondor pool and htcondor-ce pool?
Previous by thread: Re: [HTCondor-users] limit defrag to particular partitionable slots?
Next by thread: [HTCondor-users] About negotiation opportunity conditions in a multi-accounting group environment
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] limit defrag to particular partitionable slots?