Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Configuring shared resources
- Date: Mon, 7 Aug 2023 17:16:39 -0500 (CDT)
- From: Todd L Miller <tlmiller@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Configuring shared resources
For example "Team Bioinfo" and "Team Data" get priority access to one
GPU server each but can use the other teams GPU server if it's not in
use. If Team Data want to use their GPU then any of Team Bioinfo's jobs
are evicted & requeued on their own GPU. I've read something about this
in the past few weeks but can't see it now. Are there any good examples
or docs on how to do this?
We call this "condo" model. The basic idea is that you configure
the GPU servers to prefer their owners' jobs, preempting other jobs in
order to run them if necessary. Unfortunately, the configuration is
totally different, depending on if you're using static or dynamic slots.
For static slots, do something like the following on the EPs:
# Prefer jobs from team bioinfo.
RANK = (AcctGroup == "bioinfo") * 1000
# Or uncomment to prefer jobs from team data.
# RANK = (AcctGroup == "bioinfo") * 1000
# When it's time to go, it's time to go.
MAXJOBRETIREMENTTIME = 0
The default value of NEGOTIATOR_PRE_JOB_RANK includes the machine's RANK
for the job, so jobs which match either GPU machine will run on the one
that one preempt them, if given the chance.
For dynamic slots, the above will work, but only if all of your jobs are
the same "size": dynamic slots will be preempted (kicked off if there's a
job from the owners' group) only if the new job fits. If group A's jobs
use fewer cores or RAM than group B's, for example, they won't kick group
B's jobs off because they can't fit in the slot. Otherwise, you can do
something like the following:
# Slot type 1 is partitionable.
use FEATURE : PartitionableSlot(1)
# Slot type 2 is partitionable.
use FEATURE : PartitionableSlot(2)
# Let this slot know if it's using resources assigned to a type-1 slot.
SLOT_TYPE_2_BACKFILL = TRUE
# Kick jobs off type-2 slots if they use any resource in use by a type-1 slot.
SLOT_TYPE_2_PREEMPT = size(ResourceConflict?:"") > 0
# Don't start a "bioinfo" job on a slot that another "bioinfo" job will preempt.
SLOT_TYPE_2_START = (AcctGroup != "bioinfo")
# When it's time to go, it's time to go.
MAXJOBRETIREMENTTIME = 0
If I could define the team memberships via LDAP (lookups in FreeIPA?) or
similar it would be even better!
HTCondor doesn't directly support this at the moment, but you can
use the AssignAccountingGroup feature to make sure the every job defines
the "AcctGroup" attribute:
use feature:AssignAccountingGroup(map_file_name)
where map_file_name is a file that looks something like the following:
* user_name1 bioinfo
* user_name2 data
* ".*" no_group
Pulling information out of LDAP and converting it into this form is left
as an exercise for the reader. ;)
-- ToddM