[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?



Hi Greg, all

(embarrassingly only replying now after forgetting about it for too long)

On 16.08.21 17:30, Greg Thain wrote:
SIGTERM won't be set to any job whose runtime is less than MaxJobRetirementTime with a "graceful" shutdown/drain.
which basically means, I could achieve a "peaceful" defrag if I simply set MaxJobRetirementTime to near infinite, right?

In the end, I am still torn between having users with extremely long running jobs (many cores for 10-20 days) and users wanting to condor to finally match their few hour 80+ core jobs, for which is presumably need condor_defrag to free up slots that large.

Then just four quick questions (as these are deviating more and more from the original question asked, I can write-up additional emails for the list/archives if wanted):

(1) The only middle ground with condor_defrag I currently see is that we take a number of large core count machines, configure MaxJobRetirementTime to something we see reasonable (along with MaxVacateTime for the few jobs which would react to that), let condor_defrag only act on these machines and add a start expression that it will take only jobs which use a certain flag/setting in the submit file - just to prevent very long user jobs matching there.

(2) Is there a way for condor_defrag to discover if a parallel universe job is running on a partitionable slot and then NOT consider it suitable for defrag? As DEFRAG_REQUIREMENTS "only" match against the startd I don't think it can look into the slots partitioned off or can it?

(3) only slightly related: when performing some changes on the nodes/condor configuration, we usually let the node run dry via condor_off -peaceful -startd. That works, however, given some jobs are running a really long time, we sometimes simply lose track of these nodes and they simply vanish from condor_status as these are not available anymore. Is there already a mechanism these not could somehow send a notification they are empty?

Or would it be better to set START=FALSE via condor_configval stead and then monitor nodes where TotalCpus==Cpus?

(4) Is there a downside using "use feature : GPUs" on non-GPU nodes? AS we have a mix of GPU and non-GPU hosts, right now writing condor_status constraints is much more cumbersome if you need to allow for nodes not having set TotalGpus. Doing this on a test node has not really shown much of a downside, but maybe there is something hidden which could be a road block later on?

The goal is to get around the "undefined" for something simple as
  condor_status -const PartitionableSlot -af TotalGpus==0|sort|uniq
false
true
undefined

Cheers

Carsten

--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature