[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] limit defrag to particular partitionable slots?



On 7/18/2024 1:07 PM, R Tapia wrote:

Thanks for the reply. I saw this, but the documentation seems to indicate that this is a per machine (condor_startd) decision. It sounds like I could use this to exclude an entire machine from defragmentation, but not drain jobs in one slot but not the other slot.

Is that right or am I misreading the documentation?

Hi Ron,

I think you understood it correctly.

Draining is currently an operation performed on an entire EP (startd), not on a particular slot. 

However, you can configure things in such a way to (mostly) get what you want.  When you issue a condor_drain command (or when the defrag daemon does so), essentially two things happen:

  1. Any job that has run longer than the MaxJobRetirementTime on that slot will be sent a SIGTERM (assuming the default 'graceful' style of a drain)
  2. By default, the Requirements _expression_ on every slot will switch to False (so no new jobs will match the slot)

Imagine you want a startd with two partitionable slots (pslots), one name "DrainMe" that allows draining and another named "NoDrain" that does not allow draining.  The trick is to configure the startd so that the NoDrain slots have a very large MaxJobRetirementTime to negate item #1 above.  To negate item #2, when a drain command is issued, you can provide a custom Requirements clause that will allow matches to continue to happen on pslot NoDrain.

To demonstrate, I fired up the minicondor container, configured two pslots as described above, submitted a bunch of jobs, then issued a drain, saw the jobs in the NoDrain slot were still running, removed a running job and observed that the NoDrain slot happily picked up another job to run :

% docker run -it --rm --hostname test1.toddt.org htcondor/mini bash

[root@test1 /]# cat - > /etc/condor/config.d/50-todd-test

NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = 50%
SLOT_TYPE_1_Name_Prefix = NoDrain
SLOT_TYPE_1_MaxJobRetirementTime = 999999999
SLOT_TYPE_1_PARTITIONABLE = True

NUM_SLOTS_TYPE_2 = 1
SLOT_TYPE_2 = 50%
SLOT_TYPE_2_Name_Prefix = DrainMe
SLOT_TYPE_2_MaxJobRetirementTime = 0
SLOT_TYPE_2_PARTITIONABLE = True
<ctrl-d>

[root@test1 /]# condor_master


[root@test1 /]# condor_status

Name                     OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

DrainMe2@xxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 5941  0+00:00:05
NoDrain1@xxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 5941  0+00:00:05

               Total Owner Claimed Unclaimed Matched Preempting  Drain Backfill BkIdle

  X86_64/LINUX     2     0       0         2       0          0      0        0      0

         Total     2     0       0         2       0          0      0        0      0

[root@test1 /]# condor_status -af:h Name SlotId MaxJobRetirementTime

Name                     SlotId MaxJobRetirementTime
DrainMe2@xxxxxxxxxxxxxxx 2      0
NoDrain1@xxxxxxxxxxxxxxx 1      999999999

[root@test1 /]# condor_submit executable=/bin/sleep arguments=1000000 -queue 500
Submitting job(s)....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
500 job(s) submitted to cluster 1.

[root@test1 /]# condor_status
Name                       OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

DrainMe2@xxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000 5429  0+00:03:45
DrainMe2_1@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000  128  0+00:01:05
DrainMe2_2@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000  128  0+00:01:05
DrainMe2_3@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000  128  0+00:01:04
DrainMe2_4@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000  128  0+00:01:04
NoDrain1@xxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000 5429  0+00:03:45
NoDrain1_1@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000  128  0+00:01:05
NoDrain1_2@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000  128  0+00:01:05
NoDrain1_3@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000  128  0+00:01:05
NoDrain1_4@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000  128  0+00:01:04

               Total Owner Claimed Unclaimed Matched Preempting  Drain Backfill BkIdle

  X86_64/LINUX    10     0       8         2       0          0      0        0      0

         Total    10     0       8         2       0          0      0        0      0

[root@test1 /]# condor_drain -graceful -start 'SlotId == 1 ? True : False' test1.toddt.org
Sent request to drain the startd <127.0.0.1:9618?addrs=127.0.0.1-9618&alias=test1.toddt.org&noUDP&sock=startd_27_698d> with test1.toddt.org. This only affects the single startd; any other startds running on the same host will not be drained.

[root@test1 /]# condor_status
Name                       OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

DrainMe2@xxxxxxxxxxxxxxx   LINUX      X86_64 Drained   Retiring  0.000 5941  0+00:00:05
NoDrain1@xxxxxxxxxxxxxxx   LINUX      X86_64 Drained   Retiring  0.000 5429  0+00:00:05
NoDrain1_1@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Retiring  0.000  128  0+00:00:05
NoDrain1_2@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Retiring  0.000  128  0+00:00:05
NoDrain1_3@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Retiring  0.000  128  0+00:00:05
NoDrain1_4@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Retiring  0.000  128  0+00:00:05

               Total Owner Claimed Unclaimed Matched Preempting  Drain Backfill BkIdle

  X86_64/LINUX     6     0       4         0       0          0      2        0      0

         Total     6     0       4         0       0          0      2        0      0

[root@test1 /]# condor_q -run

-- Schedd: test1.toddt.org : <127.0.0.1:33257?... @ 07/18/24 21:12:34
 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)
   1.0   condor          7/18 21:07   0+00:04:38 NoDrain1_1@xxxxxxxxxxxxxxx
   1.2   condor          7/18 21:07   0+00:04:38 NoDrain1_2@xxxxxxxxxxxxxxx
   1.4   condor          7/18 21:07   0+00:04:38 NoDrain1_3@xxxxxxxxxxxxxxx
   1.6   condor          7/18 21:07   0+00:04:38 NoDrain1_4@xxxxxxxxxxxxxxx

[root@test1 /]# condor_rm 1.6   # remove the job running on slot NoDrain1_4
Job 1.6 marked for removal

[root@test1 /]# condor_status   # confirm that slot NoDrain1_4 picks up a new job
Name                       OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

DrainMe2@xxxxxxxxxxxxxxx   LINUX      X86_64 Drained   Retiring  0.000 5941  0+00:03:00
NoDrain1@xxxxxxxxxxxxxxx   LINUX      X86_64 Drained   Retiring  0.000 5429  0+00:03:00
NoDrain1_1@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Retiring  0.000  128  0+00:03:00
NoDrain1_2@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Retiring  0.000  128  0+00:03:00
NoDrain1_3@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Retiring  0.000  128  0+00:03:00
NoDrain1_4@xxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000  128  0+00:00:03

               Total Owner Claimed Unclaimed Matched Preempting  Drain Backfill BkIdle

  X86_64/LINUX     6     0       4         0       0          0      2        0      0

         Total     6     0       4         0       0          0      2        0      0



Hope this helps, feel free to ask any additional questions.
regards,
Todd

-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685