Oops, forgot to mention. This is for: $CondorVersion: 9.0.17 May 27 2023 BuildID: 649540 PackageID: 9.0.17-3 $ $CondorPlatform: x86_64_Rocky8 $ Martin From: Beaumont, Martin
Hi all, Quick question: is it normal that the Job ClassAd CommittedTime keeps being 0, even after job completion? After a few quick tests: It stays at 0 with parallel and vanilla jobs while using dynamic partitionable slots. It stays at 0 with parallel jobs without dynamic partitionable slots. “condor_history -long” finally shows something higher than 0 with serial jobs without dynamic partitionable slots. I’m trying to find the best way to put a time limit on long jobs, put them on Hold temporarily to let other higher priority queued jobs get their chance, and then release the long jobs to get back in queue. Keep in mind I have parallel and serial jobs running simultaneously on all execute nodes. So normal pre-empting across all slots doesn’t work. Also, for MPI jobs, “save points” are the responsibility of the R&D software/wrapper/user to handle. The working dir and apps are all on NFS. So far, I came up with this configuration (timings to be confirmed): --------------------------------------------- # Priorization using 2 groups
NEGOTIATOR_ALLOW_QUOTA_OVERSUBSCRIPTION = True GROUP_NAMES = low_priority, high_priority GROUP_QUOTA_low_priority = 1 GROUP_QUOTA_high_priority = 1000000 # Force submitters to use the priorization groups SUBMIT_REQUIREMENT_NAMES = accountinggroup SUBMIT_REQUIREMENT_accountinggroup = IfThenElse( AccountingGroup Isnt UNDEFINED, IfThenElse( stringListMember( AcctGroup, "low_priority, high_priority"), TRUE, FALSE),
FALSE) SUBMIT_REQUIREMENT_accountinggroup_REASON = "accounting_group must be one of: low_priority, high_priority" # Put jobs on Hold if running longer than 2 weeks #SYSTEM_PERIODIC_HOLD = ( RemoteWallClockTime - CumulativeSuspensionTime ) > 1209600 SYSTEM_PERIODIC_HOLD = ( RemoteUserCpu / RequestCpus ) > 1209600 #SYSTEM_PERIODIC_HOLD = ( CommittedTime - CommittedSuspensionTime ) > 1209600 # Release Held jobs every 10mins for a maximum of 5 times SYSTEM_PERIODIC_RELEASE = (JobRunCount < 5 && (time() - EnteredCurrentStatus) > 600 ) # Finally, remove jobs that have been put in Run state 5 times SYSTEM_PERIODIC_REMOVE = (JobRunCount == 5) --------------------------------------------- I can’t use RemoteWallClockTime since it cumulates and does not reset during the Hold/Release process. I can’t substract
CommittedSuspensionTime since, like
CommittedTime, it stays at 0. The Cumulative* classads don’t seem to update during job execution. I don’t understand how to use
AllowedJobDuration as they don’t show up in my jobs classads by default. I’d like to manage this from my side (config file), not the user’s (submit file). The best work around I found was to use
RemoteUserCpu and RequestCpus, but doing so will exclude the possibility of a bugged job that is sitting there without using CPU time. Any suggestions? Thanks! Martin |