Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?
- Date: Thu, 02 Sep 2021 09:55:35 -0500
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?
On 9/2/21 2:07 AM, Carsten Aulbert wrote:
which basically means, I could achieve a "peaceful" defrag if I simply
set MaxJobRetirementTime to near infinite, right?
That's correct.
In the end, I am still torn between having users with extremely long
running jobs (many cores for 10-20 days) and users wanting to condor
to finally match their few hour 80+ core jobs, for which is presumably
need condor_defrag to free up slots that large.
Then just four quick questions (as these are deviating more and more
from the original question asked, I can write-up additional emails for
the list/archives if wanted):
(1) The only middle ground with condor_defrag I currently see is that
we take a number of large core count machines, configure
MaxJobRetirementTime to something we see reasonable (along with
MaxVacateTime for the few jobs which would react to that), let
condor_defrag only act on these machines and add a start expression
that it will take only jobs which use a certain flag/setting in the
submit file - just to prevent very long user jobs matching there.
Note that MaxJobRetirementTime is an expression that can look at
attributes of the running job on the machine. Here in the CHTC at
Wisconsin, we allow users to declare (with a +LongJob=true custom
attribute in their job submit file) that their job may need
longer-than-usual walltime, and MaxJobRetirementTime honors that. As a
tradeoff, there are a lot of machines these jobs won't match with.
(2) Is there a way for condor_defrag to discover if a parallel
universe job is running on a partitionable slot and then NOT consider
it suitable for defrag? As DEFRAG_REQUIREMENTS "only" match against
the startd I don't think it can look into the slots partitioned off or
can it?
I haven't tested this, but out of the box, the JobUniverse of a running
job is advertised in the dynamic slot, but not the partitionable slot
today. Some attributes of the dynamic slots are "rolled up" into an
array in the partitionable slot into a classad array named childXXX.Â
There is a startd knob, STARTD_PARTITIONABLE_SLOT_ATTRS, which adds
attributes to this set. I think you could add
STARTD_PARTITIONABLE_SLOT_ATTRS = JobUniverse
and then the partitionable slot would get a classad array named
childJobUniverse, with a value of the jobUniverses for each of the
dynamic slots. You could then use this attribute in the defrag
requirements expression.
(4) Is there a downside using "use feature : GPUs" on non-GPU nodes?
AS we have a mix of GPU and non-GPU hosts, right now writing
condor_status constraints is much more cumbersome if you need to allow
for nodes not having set TotalGpus. Doing this on a test node has not
really shown much of a downside, but maybe there is something hidden
which could be a road block later on?
The only downside to having "use feature: GPUs" on at all times that
I've seen is when you have machines with GPUs that you don't really want
to use, or can't use (like desktops with onboard GPUs).
-greg