[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Limiting jobs to two days



Running 25.7.2-1+ubu24 on Ubuntu24 on central manager, access point, and execute node.

I set a test job to 60 and itâs running over an hour:

ActivationDuration = 5
ActivationExecutionDuration = 5
ActivationSetupDuration = 0
ActivationTeardownDuration = 0
AllowedExecuteDuration = 60
Args = "--cpu --onsig"
AuthTokenId = "da90ddd6609bb3f39bd86b7caf08dc30"
AuthTokenIssuer = "condor-mgr-lts.nmrbox.org"
AuthTokenSubject = "gweatherby@xxxxxxxxxx"
AutoClusterAttrs = "FirstUpdateUptimeGPUsSeconds,LastUpdateUptimeGPUsSeconds,Production,RequestCpus,RequestDisk,RequestGPUs,RequestMemory,StartOfJobUptimeGPUsSeconds,UptimeGPUsSeconds,ConcurrencyLimits,FlockTo,Rank,Requirements,ChtcProjects,ContainerImageSource,DockerImage,GlideinFactory,GPUs_Capability,GPUs_DeviceName,GPUs_DriverVersion,GPUs_GlobalMemoryMb,GPUs_MaxSupportedVersion,InteractiveJob,is_resumable,IsBuildJob,LongJob,Owner,PreventJobsReason,PrioritizedProjects,profiling,ProjectName,RequestIoHeavy,want_campus_pools,want_ospool,WantFlocking,WantGlidein,GPUJobLength,WantGPULab,HasRaddusHtcCephFS,FileSystemDomain,Machine,TransferInputSizeMB"
AutoClusterId = 3
BlockReadKbytes = 0
BlockReads = 0
BlockWriteKbytes = 0
BlockWrites = 0
BytesRecvd = 0.0
BytesSent = 0.0
ChtcProjects = undefined
ClusterId = 995417
Cmd = "/home/nmrbox/gweatherby/condor/signal_catcher.py"
CommittedSlotTime = 0
CommittedSuspensionTime = 0
CommittedTime = 0
CondorPlatform = "$CondorPlatform: X86_64-Ubuntu_24.04 $"
CondorVersion = "$CondorVersion: 25.7.2 2026-03-11 BuildID: 881773 PackageID: 25.7.2-1+ubu24 GitSHA: 433de0b3 $"
CpusProvisioned = 1
CpusUsage = 1.000184786082713
CumulativeRemoteSysCpu = 0.0
CumulativeRemoteUserCpu = 6067.0
CumulativeSlotTime = 2761.0
CumulativeSuspensionTime = 0
CurrentHosts = 1
DiskProvisioned = 1048576
DiskUsage = 2
DiskUsage_RAW = 2
EnteredCurrentStatus = 1774536434 (2026-03-26 10:47:14)
Environment = ""
Err = "/dev/null"
ExecutableSize = 2
ExecutableSize_RAW = 2
ExecuteDirWasEncrypted = false
ExitBySignal = false
ExitCode = 15
ExitReason = "died on signal 9 (Killed)"
ExitSignal = 9
ExitStatus = 0
FileSystemDomain = "nmrbox.org"
FirstJobMatchDate = 1774533302 (2026-03-26 09:55:02)
GlobalJobId = "condor-ap-lts.nmrbox.org#995417.0#1774533301"
GPUsProvisioned = 0
ImageSize = 7500
ImageSize_RAW = 5416
In = "/dev/null"
InitialWaitDuration = 1
Iwd = "/home/nmrbox/gweatherby/condor"
JobCurrentReconnectAttempt = undefined
JobCurrentStartDate = 1774536434 (2026-03-26 10:47:14)
JobCurrentStartExecutingDate = 1774536434 (2026-03-26 10:47:14)
JobLastStartDate = 1774536233 (2026-03-26 10:43:53)
JobLeaseDuration = 2400
JobNotification = 0
JobPrio = 0
JobRunCount = 3
JobStartDate = 1774533302 (2026-03-26 09:55:02)
JobStatus = 2
JobSubmitFile = "jobsig"
JobSubmitMethod = 0
JobUniverse = 5
LastJobLeaseRenewal = 1774540385 (2026-03-26 11:53:05)
LastJobStatus = 1
LastMatchTime = 1774536434 (2026-03-26 10:47:14)
LastPublicClaimId = "<155.37.253.100:9618?addrs=155.37.253.100-9618&alias=argon.nmrbox.org&noUDP&sock=startd_58990_cfd8>#1774536063#1#..."
LastRejMatchNegotiator = "condor-mgr-lts.nmrbox.org"
LastRejMatchReason = "no match found"
LastRejMatchTime = 1774536434 (2026-03-26 10:47:14)
LastRemoteHost = "slot1_1@xxxxxxxxxxxxxxxx"
LastRemoteWallClockTime = 6.0
LastSuspensionTime = 0
LastVacateTime = 1774536057 (2026-03-26 10:40:57)
LeaveJobInQueue = false
MachineAttrCpus0 = 1
MachineAttrSlotWeight0 = 1
MaxHosts = 1
MemoryProvisioned = 2048
MemoryUsage = ((ResidentSetSize + 1023) / 1024)
MinHosts = 1
MyType = "Job"
NumCkpts = 0
NumCkpts_RAW = 0
NumJobCompletions = 0
NumJobMatches = 3
NumJobStarts = 3
NumRestarts = 0
NumShadowStarts = 3
NumSystemHolds = 0
NumVacates = 2
NumVacatesByReason = [ StartdShutdown = 1; StartdPreemptExpression = 1 ]
OrigMaxHosts = 1
Out = "/home/nmrbox/gweatherby/condor/sig_995417.out"
Owner = "gweatherby"
ProcId = 0
Production = false
ProjectName = "NMRbox"
PublicClaimId = "<155.37.253.100:9618?addrs=155.37.253.100-9618&alias=argon.nmrbox.org&noUDP&sock=startd_71281_6a63>#1774536390#1#..."
QDate = 1774533301 (2026-03-26 09:55:01)
Rank = 0.0
RecentBlockReadKbytes = 0
RecentBlockReads = 0
RecentBlockWriteKbytes = 0
RecentBlockWrites = 0
RecentStatsLifetimeStarter = 1200
RemoteHost = "slot1_1@xxxxxxxxxxxxxxxx"
RemoteSlotID = 1
RemoteSysCpu = 0.0
RemoteUserCpu = 3311.0
RemoteWallClockTime = 2761.0
RequestCpus = 1
RequestDisk = MAX({ 1024,(TransferInputSizeMB + 1) * 1.25 }) * 1024
RequestMemory = 2048
Requirements = ((Machine == "argon.nmrbox.org")) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
ResidentSetSize = 7500
ResidentSetSize_RAW = 5416
ServerTime = 1774540508 (2026-03-26 11:55:08)
ShadowBday = 1774536434 (2026-03-26 10:47:14)
ShouldTransferFiles = "IF_NEEDED"
StartdIpAddr = "<155.37.253.100:9618?addrs=155.37.253.100-9618&alias=argon.nmrbox.org&noUDP&sock=startd_71281_6a63>"
StartdPrincipal = "execute-side@matchsession/155.37.253.100"
StatsLifetimeStarter = 3315
StreamOut = false
TargetType = "Machine"
TotalSubmitProcs = 1
TotalSuspensions = 0
TransferErr = false
TransferIn = false
TransferInputFileCounts = [ CEDAR = 1 ]
TransferInputSizeMB = 1
TransferInputStats = [  ]
TransferOutputStats = [  ]
User = "gweatherby@xxxxxxxxxx"
UserLog = "/home/nmrbox/gweatherby/condor/sig_995417.log"
WhenToTransferOutput = "ON_EXIT"



From: Cole Bollig <cabollig@xxxxxxxx>
Date: Thursday, March 26, 2026 at 11:13âAM
To: Weatherby,Gerard <gweatherby@xxxxxxxx>, HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days

*** Attention: This is an external email. ***
Use caution responding, opening attachments or clicking on links.
 
Hi Gerard,

One possible reason is AllowedExecuteDuration only applies for actual job execution time and not the entire running state, so input file transfer times are not accounted for in the enforcement. This matters if you want to limit the total time a job is on an EP as opposed to the total time a job has to be executed. EPs with versions that don't know how to use this functionality should hopefully not be a factor unless you have EPs running versions older than v9.4.1 or v9.5.0.

-Cole Bollig

From: Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Thursday, March 26, 2026 9:49 AM
To: Cole Bollig <cabollig@xxxxxxxx>; HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days
 
Hi Cole,

That works great. AllowedExecuteDuration is being as desired.

Our cluster does not seem to be consistently limiting the jobs to the specified duration. Are there value(s) that have to be set for HTCondor to monitor AllowedExecuteDuration?



From: Cole Bollig <cabollig@xxxxxxxx>
Date: Wednesday, March 25, 2026 at 2:16âPM
To: Weatherby,Gerard <gweatherby@xxxxxxxx>, HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days

*** Attention: This is an external email. ***
Use caution responding, opening attachments or clicking on links.
 
Hi Gerard,

The issue here is that the submit transforms take place before the submit requirements check. Since the transform defines AllowedExecuteDuration with a default value, AllowedExecuteDuration is never UNDEFINED. I was able to achieve your desired behavior with the following configuration (complete):

JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) SetTimeLimit
JOB_TRANSFORM_SetTimeLimit @=end
ââââââREQUIREMENTS AllowedExecuteDuration =?= UNDEFINED
ââââââEVALSET AllowedExecuteDuration (2 * 24 * 60 * 60)
ââââââSET UsingDefaultMaxRuntime True
@end

SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) AddCap MaxExecuteDuration

# Submit requirement: explicitly reject jobs that define a duration exceeding the site limit
SUBMIT_REQUIREMENT_MaxExecuteDuration = AllowedExecuteDuration <= (2 * 24 * 60 * 60)
SUBMIT_REQUIREMENT_MaxExecuteDuration_REASON = strcat("AllowedExecuteDuration of ", AllowedExecuteDuration, " seconds exceeds the site limit of ", (2 * 24 * 60 * 60), " seconds (2 days).")

# Submit warning: inform user of default max 2 day runtime
SUBMIT_REQUIREMENT_AddCap = UsingDefaultMaxRuntime =!= True
SUBMIT_REQUIREMENT_AddCap_REASON = "As of Mar 26th, maximum runtime is set to 2 days" 
SUBMIT_REQUIREMENT_AddCap_IS_WARNING = TRUE

Note: This also has the added benefit of allowing you to query the Schedd or history for the UsingDefaultMaxRuntime attribute to see how many jobs or what users are not explicitly setting a maximum runtime (if you desire to do some analysis in that field of information).

Cheers,
Cole Bollig

From: Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Wednesday, March 25, 2026 12:18 PM
To: Cole Bollig <cabollig@xxxxxxxx>; HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days
 
Cole,

Fantastic.  Weâve been able to get it to work:

# Transform: only set the 2-day default when the user has NOT defined a duration
JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) SetTimeLimit
JOB_TRANSFORM_SetTimeLimit @=end
   REQUIREMENTS AllowedExecuteDuration =?= UNDEFINED
   EVALSET AllowedExecuteDuration (2 * 24 * 60 * 60)
@end

# Submit requirement: explicitly reject jobs that define a duration exceeding the site limit
SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) MaxExecuteDuration
SUBMIT_REQUIREMENT_MaxExecuteDuration = AllowedExecuteDuration <= (2 * 24 * 60 * 60)
SUBMIT_REQUIREMENT_MaxExecuteDuration_REASON = strcat( \
  "AllowedExecuteDuration of ", AllowedExecuteDuration, \
  " seconds exceeds the site limit of ", (2 * 24 * 60 * 60), " seconds (2 days).")


The last piece weâd like to do is notify the user of the transform setting the 2-day limit. I tried this, but it did not seem to work:

SUBMIT_REQUIREMENT_NAMES = AddCap $(SUBMIT_REQUIREMENT_NAMES)
SUBMIT_REQUIREMENT_AddCap = AllowedExecuteDuration =!= UNDEFINED
SUBMIT_REQUIREMENT_AddCap_REASON = "As of Mar 26th, maximum runtime is set to 2 days" 
SUBMIT_REQUIREMENT_AddCap_IS_WARNING = TRUE

From: Cole Bollig <cabollig@xxxxxxxx>
Date: Wednesday, March 25, 2026 at 9:40âAM
To: Weatherby,Gerard <gweatherby@xxxxxxxx>, HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days

*** Attention: This is an external email. ***
Use caution responding, opening attachments or clicking on links.
 
Hi Gerard,

My colleague just reminded me this morning of the two following first class JDL commands allowed_execute_duration (maximum execution time of one job epoch). While this command is intended to be used by user, the AP can set this limit via a submit transform to ensure all jobs placed to that AP has a two day limit:

JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) SetTimeLimit
JOB_TRANSFORM_SetTimeLimit @=end
               # Set 2 day limit for any jobs that don't define a max duration or define a duration greater than the two day limit
ââââââREQUIREMENTS AllowedExecuteDuration =?= UNDEFINED || AllowedExecuteDuration > (2 * 24 * 60 * 60)
ââââââEVALSET AllowedExecuteDuration (2 * 24 * 60 * 60)
@end

This should cause non-checkpointing jobs to go on hold with a nice message while making checkpointing jobs go back into the queue for further matchmaking. Note in this sample configuration I am overwriting any user defined execute duration greater than the desired limit (2 days). If you wanted to make this less silent behavior you could move the second clause of the requirements and put it into an explicit submit requirements but inversed to make job placement fail if the user defined an allowed execution duration greater than the systems desired limit.

Cheers,
Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, March 24, 2026 3:27 PM
To: gweatherby@xxxxxxxx <gweatherby@xxxxxxxx>; HTCondor Users <htcondor-users@xxxxxxxxxxx>
Cc: Cole Bollig <cabollig@xxxxxxxx>
Subject: Re: [HTCondor-users] Limiting jobs to two days
 
Hi Gerard,

If you want to control this logic from the Access Point (AP) then you would want to use SYSTEM_PERIODIC_VACATE to kick any jobs exceeding the desired execute time and allow them to go back into the queue for matchmaking. Here in our local CHTC pool we do max execution timeout on the Execution Point side of things. It would take some time to dig that configuration out and strip out CHTC pool specifics, but it is based on this 2015 HTC presentation.

-Cole Bollig


From: Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Tuesday, March 24, 2026 1:40 PM
To: Cole Bollig <cabollig@xxxxxxxx>; HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days
 
We want the job to release the scarce resource on the EP (the GPUs) and let other jobs that have been waiting have a turn. Ideally, the job would get back in line. (We will be urging our users to checkpoint their jobs).


From: Cole Bollig <cabollig@xxxxxxxx>
Date: Tuesday, March 24, 2026 at 1:51âPM
To: HTCondor Users <htcondor-users@xxxxxxxxxxx>
Cc: Weatherby,Gerard <gweatherby@xxxxxxxx>
Subject: Re: Limiting jobs to two days

*** Attention: This is an external email. ***
Use caution responding, opening attachments or clicking on links.
 
Hi Gerard,


-Cole Bollig 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Weatherby,Gerard via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, March 24, 2026 12:24 PM
To: HTCondor Users <htcondor-users@xxxxxxxxxxx>
Cc: gweatherby@xxxxxxxx <gweatherby@xxxxxxxx>
Subject: [HTCondor-users] Limiting jobs to two days
 
We want to limit user jobs to two days to more fairly allocate resources. Weâre asking user to checkpoint their jobs if they are going to run longer than that.

Itâs not clear which SYSTEM_PERIODIC_ we should set to best implement this.


-----------------------------------

 

GERARD WEATHERBY

Application Architect

 

NMRhub

nmrhub.org

 

signature_1266212082