Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?

Date: Thu, 02 Sep 2021 09:07:52 +0200
From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
Subject: Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?

Hi Greg, all

(embarrassingly only replying now after forgetting about it for too long)

On 16.08.21 17:30, Greg Thain wrote:

SIGTERM won't be set to any job whose runtime is less thanMaxJobRetirementTime with a "graceful" shutdown/drain.

which basically means, I could achieve a "peaceful" defrag if I simplyset MaxJobRetirementTime to near infinite, right?

In the end, I am still torn between having users with extremely longrunning jobs (many cores for 10-20 days) and users wanting to condor tofinally match their few hour 80+ core jobs, for which is presumably needcondor_defrag to free up slots that large.

Then just four quick questions (as these are deviating more and morefrom the original question asked, I can write-up additional emails forthe list/archives if wanted):

(1) The only middle ground with condor_defrag I currently see is that wetake a number of large core count machines, configureMaxJobRetirementTime to something we see reasonable (along withMaxVacateTime for the few jobs which would react to that), letcondor_defrag only act on these machines and add a start expression thatit will take only jobs which use a certain flag/setting in the submitfile - just to prevent very long user jobs matching there.

(2) Is there a way for condor_defrag to discover if a parallel universejob is running on a partitionable slot and then NOT consider it suitablefor defrag? As DEFRAG_REQUIREMENTS "only" match against the startd Idon't think it can look into the slots partitioned off or can it?

(3) only slightly related: when performing some changes on thenodes/condor configuration, we usually let the node run dry viacondor_off -peaceful -startd. That works, however, given some jobs arerunning a really long time, we sometimes simply lose track of thesenodes and they simply vanish from condor_status as these are notavailable anymore. Is there already a mechanism these not could somehowsend a notification they are empty?

Or would it be better to set START=FALSE via condor_configval steadand then monitor nodes where TotalCpus==Cpus?

(4) Is there a downside using "use feature : GPUs" on non-GPU nodes? ASwe have a mix of GPU and non-GPU hosts, right now writing condor_statusconstraints is much more cumbersome if you need to allow for nodes nothaving set TotalGpus. Doing this on a test node has not really shownmuch of a downside, but maybe there is something hidden which could be aroad block later on?


The goal is to get around the "undefined" for something simple as
  condor_status -const PartitionableSlot -af TotalGpus==0|sort|uniq
false
true
undefined

Cheers

Carsten

--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?
  - From: Greg Thain
- Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?
  - From: Beyer, Christoph

Prev by Date: Re: [HTCondor-users] Negotiator only allocating 1 job per machine per cycle
Next by Date: Re: [HTCondor-users] Configuring Condor for fully qualified usernames?
Previous by thread: Re: [HTCondor-users] Negotiator only allocating 1 job per machine per cycle
Next by thread: Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?