From: Cole Bollig <cabollig@xxxxxxxx>
Date: Friday, March 27, 2026 at 9:39âAM
To: Weatherby,Gerard <gweatherby@xxxxxxxx>, HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days
*** Attention: This is an external email. ***
Use caution responding, opening attachments or clicking on links.
Hi Gerard,
If you check the AP's shadow log, do you see any messages containing 'The job exceeded allowed execute duration of'. This should be followed by a 'Sending DEACTIVATE_CLAIM to startd' message. I just want to confirm the shadow is properly triggering the allowed
execute duration. What is likely happening is on the EP side is getting the deactivate claim and then allowing the job to execute for a certain amount of time extra due to a max vacate time. What does
condor_config_val -dump vacate and condor_config_val kill on your EP say?
-Cole Bollig
From: Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Thursday, March 26, 2026 10:56 AM
To: Cole Bollig <cabollig@xxxxxxxx>; HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days
Running 25.7.2-1+ubu24 on Ubuntu24 on central manager, access point, and execute node.
I set a test job to 60 and itâs running over an hour:
ActivationDuration = 5
ActivationExecutionDuration = 5
ActivationSetupDuration = 0
ActivationTeardownDuration = 0
AllowedExecuteDuration = 60
Args = "--cpu --onsig"
AuthTokenId = "da90ddd6609bb3f39bd86b7caf08dc30"
AuthTokenIssuer = "condor-mgr-lts.nmrbox.org"
AuthTokenSubject = "gweatherby@xxxxxxxxxx"
AutoClusterAttrs = "FirstUpdateUptimeGPUsSeconds,LastUpdateUptimeGPUsSeconds,Production,RequestCpus,RequestDisk,RequestGPUs,RequestMemory,StartOfJobUptimeGPUsSeconds,UptimeGPUsSeconds,ConcurrencyLimits,FlockTo,Rank,Requirements,ChtcProjects,ContainerImageSource,DockerImage,GlideinFactory,GPUs_Capability,GPUs_DeviceName,GPUs_DriverVersion,GPUs_GlobalMemoryMb,GPUs_MaxSupportedVersion,InteractiveJob,is_resumable,IsBuildJob,LongJob,Owner,PreventJobsReason,PrioritizedProjects,profiling,ProjectName,RequestIoHeavy,want_campus_pools,want_ospool,WantFlocking,WantGlidein,GPUJobLength,WantGPULab,HasRaddusHtcCephFS,FileSystemDomain,Machine,TransferInputSizeMB"
AutoClusterId = 3
BlockReadKbytes = 0
BlockReads = 0
BlockWriteKbytes = 0
BlockWrites = 0
BytesRecvd = 0.0
BytesSent = 0.0
ChtcProjects = undefined
ClusterId = 995417
Cmd = "/home/nmrbox/gweatherby/condor/signal_catcher.py"
CommittedSlotTime = 0
CommittedSuspensionTime = 0
CommittedTime = 0
CondorPlatform = "$CondorPlatform: X86_64-Ubuntu_24.04 $"
CondorVersion = "$CondorVersion: 25.7.2 2026-03-11 BuildID: 881773 PackageID: 25.7.2-1+ubu24 GitSHA: 433de0b3 $"
CpusProvisioned = 1
CpusUsage = 1.000184786082713
CumulativeRemoteSysCpu = 0.0
CumulativeRemoteUserCpu = 6067.0
CumulativeSlotTime = 2761.0
CumulativeSuspensionTime = 0
CurrentHosts = 1
DiskProvisioned = 1048576
DiskUsage = 2
DiskUsage_RAW = 2
EnteredCurrentStatus = 1774536434 (2026-03-26 10:47:14)
Environment = ""
Err = "/dev/null"
ExecutableSize = 2
ExecutableSize_RAW = 2
ExecuteDirWasEncrypted = false
ExitBySignal = false
ExitCode = 15
ExitReason = "died on signal 9 (Killed)"
ExitSignal = 9
ExitStatus = 0
FileSystemDomain = "nmrbox.org"
FirstJobMatchDate = 1774533302 (2026-03-26 09:55:02)
GlobalJobId = "condor-ap-lts.nmrbox.org#995417.0#1774533301"
GPUsProvisioned = 0
ImageSize = 7500
ImageSize_RAW = 5416
In = "/dev/null"
InitialWaitDuration = 1
Iwd = "/home/nmrbox/gweatherby/condor"
JobCurrentReconnectAttempt = undefined
JobCurrentStartDate = 1774536434 (2026-03-26 10:47:14)
JobCurrentStartExecutingDate = 1774536434 (2026-03-26 10:47:14)
JobLastStartDate = 1774536233 (2026-03-26 10:43:53)
JobLeaseDuration = 2400
JobNotification = 0
JobPrio = 0
JobRunCount = 3
JobStartDate = 1774533302 (2026-03-26 09:55:02)
JobStatus = 2
JobSubmitFile = "jobsig"
JobSubmitMethod = 0
JobUniverse = 5
LastJobLeaseRenewal = 1774540385 (2026-03-26 11:53:05)
LastJobStatus = 1
LastMatchTime = 1774536434 (2026-03-26 10:47:14)
LastPublicClaimId = "<155.37.253.100:9618?addrs=155.37.253.100-9618&alias=argon.nmrbox.org&noUDP&sock=startd_58990_cfd8>#1774536063#1#..."
LastRejMatchNegotiator = "condor-mgr-lts.nmrbox.org"
LastRejMatchReason = "no match found"
LastRejMatchTime = 1774536434 (2026-03-26 10:47:14)
LastRemoteHost = "slot1_1@xxxxxxxxxxxxxxxx"
LastRemoteWallClockTime = 6.0
LastSuspensionTime = 0
LastVacateTime = 1774536057 (2026-03-26 10:40:57)
LeaveJobInQueue = false
MachineAttrCpus0 = 1
MachineAttrSlotWeight0 = 1
MaxHosts = 1
MemoryProvisioned = 2048
MemoryUsage = ((ResidentSetSize + 1023) / 1024)
MinHosts = 1
MyType = "Job"
NumCkpts = 0
NumCkpts_RAW = 0
NumJobCompletions = 0
NumJobMatches = 3
NumJobStarts = 3
NumRestarts = 0
NumShadowStarts = 3
NumSystemHolds = 0
NumVacates = 2
NumVacatesByReason = [ StartdShutdown = 1; StartdPreemptExpression = 1 ]
OrigMaxHosts = 1
Out = "/home/nmrbox/gweatherby/condor/sig_995417.out"
Owner = "gweatherby"
ProcId = 0
Production = false
ProjectName = "NMRbox"
PublicClaimId = "<155.37.253.100:9618?addrs=155.37.253.100-9618&alias=argon.nmrbox.org&noUDP&sock=startd_71281_6a63>#1774536390#1#..."
QDate = 1774533301 (2026-03-26 09:55:01)
Rank = 0.0
RecentBlockReadKbytes = 0
RecentBlockReads = 0
RecentBlockWriteKbytes = 0
RecentBlockWrites = 0
RecentStatsLifetimeStarter = 1200
RemoteHost = "slot1_1@xxxxxxxxxxxxxxxx"
RemoteSlotID = 1
RemoteSysCpu = 0.0
RemoteUserCpu = 3311.0
RemoteWallClockTime = 2761.0
RequestCpus = 1
RequestDisk = MAX({ 1024,(TransferInputSizeMB + 1) * 1.25 }) * 1024
RequestMemory = 2048
Requirements = ((Machine == "argon.nmrbox.org")) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
ResidentSetSize = 7500
ResidentSetSize_RAW = 5416
ServerTime = 1774540508 (2026-03-26 11:55:08)
ShadowBday = 1774536434 (2026-03-26 10:47:14)
ShouldTransferFiles = "IF_NEEDED"
StartdIpAddr = "<155.37.253.100:9618?addrs=155.37.253.100-9618&alias=argon.nmrbox.org&noUDP&sock=startd_71281_6a63>"
StartdPrincipal = "execute-side@matchsession/155.37.253.100"
StatsLifetimeStarter = 3315
StreamOut = false
TargetType = "Machine"
TotalSubmitProcs = 1
TotalSuspensions = 0
TransferErr = false
TransferIn = false
TransferInputFileCounts = [ CEDAR = 1 ]
TransferInputSizeMB = 1
TransferInputStats = [ ]
TransferOutputStats = [ ]
User = "gweatherby@xxxxxxxxxx"
UserLog = "/home/nmrbox/gweatherby/condor/sig_995417.log"
WhenToTransferOutput = "ON_EXIT"
From: Cole Bollig <cabollig@xxxxxxxx>
Date: Thursday, March 26, 2026 at 11:13âAM
To: Weatherby,Gerard <gweatherby@xxxxxxxx>, HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days
*** Attention: This is an external email. ***
Use caution responding, opening attachments or clicking on links.
Hi Gerard,
One possible reason is AllowedExecuteDuration only applies for actual job execution time and not the entire running state, so input file transfer times are not accounted for in the enforcement. This matters if you want to limit the total time a job is on an
EP as opposed to the total time a job has to be executed. EPs with versions that don't know how to use this functionality should hopefully not be a factor unless you have EPs running versions older than v9.4.1 or v9.5.0.
-Cole Bollig
From: Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Thursday, March 26, 2026 9:49 AM
To: Cole Bollig <cabollig@xxxxxxxx>; HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days
Hi Cole,
That works great. AllowedExecuteDuration is being as desired.
Our cluster does not seem to be consistently limiting the jobs to the specified duration. Are there value(s) that have to be set for HTCondor to monitor AllowedExecuteDuration?
From: Cole Bollig <cabollig@xxxxxxxx>
Date: Wednesday, March 25, 2026 at 2:16âPM
To: Weatherby,Gerard <gweatherby@xxxxxxxx>, HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days
*** Attention: This is an external email. ***
Use caution responding, opening attachments or clicking on links.
Hi Gerard,
The issue here is that the submit transforms take place before the submit requirements check. Since the transform defines AllowedExecuteDuration with a default value, AllowedExecuteDuration is never UNDEFINED. I was able to achieve your desired behavior with
the following configuration (complete):
JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) SetTimeLimit
JOB_TRANSFORM_SetTimeLimit @=end
ââââââREQUIREMENTS AllowedExecuteDuration =?= UNDEFINED
ââââââEVALSET AllowedExecuteDuration (2 * 24 * 60 * 60)
ââââââSET UsingDefaultMaxRuntime True
@end
SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) AddCap MaxExecuteDuration
# Submit requirement: explicitly reject jobs that define a duration exceeding the site limit
SUBMIT_REQUIREMENT_MaxExecuteDuration = AllowedExecuteDuration <= (2 * 24 * 60 * 60)
SUBMIT_REQUIREMENT_MaxExecuteDuration_REASON = strcat("AllowedExecuteDuration of ", AllowedExecuteDuration, " seconds exceeds the site limit of ", (2 * 24 * 60 * 60), " seconds (2 days).")
# Submit warning: inform user of default max 2 day runtime
SUBMIT_REQUIREMENT_AddCap = UsingDefaultMaxRuntime =!= True
SUBMIT_REQUIREMENT_AddCap_REASON = "As of Mar 26th, maximum runtime is set to 2 days"
SUBMIT_REQUIREMENT_AddCap_IS_WARNING = TRUE
Note: This also has the added benefit of allowing you to query the Schedd or history for the UsingDefaultMaxRuntime attribute to see how many jobs or what users are not explicitly setting a maximum runtime (if you desire to do some analysis in that field of
information).
Cheers,
Cole Bollig
From: Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Wednesday, March 25, 2026 12:18 PM
To: Cole Bollig <cabollig@xxxxxxxx>; HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days
Cole,
Fantastic. Weâve been able to get it to work:
# Transform: only set the 2-day default when the user has NOT defined a duration
JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) SetTimeLimit
JOB_TRANSFORM_SetTimeLimit @=end
REQUIREMENTS AllowedExecuteDuration =?= UNDEFINED
EVALSET AllowedExecuteDuration (2 * 24 * 60 * 60)
@end
# Submit requirement: explicitly reject jobs that define a duration exceeding the site limit
SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) MaxExecuteDuration
SUBMIT_REQUIREMENT_MaxExecuteDuration = AllowedExecuteDuration <= (2 * 24 * 60 * 60)
SUBMIT_REQUIREMENT_MaxExecuteDuration_REASON = strcat( \
"AllowedExecuteDuration of ", AllowedExecuteDuration, \
" seconds exceeds the site limit of ", (2 * 24 * 60 * 60), " seconds (2 days).")
The last piece weâd like to do is notify the user of the transform setting the 2-day limit. I tried this, but it did not seem to work:
SUBMIT_REQUIREMENT_NAMES = AddCap $(SUBMIT_REQUIREMENT_NAMES)
SUBMIT_REQUIREMENT_AddCap = AllowedExecuteDuration =!= UNDEFINED
SUBMIT_REQUIREMENT_AddCap_REASON = "As of Mar 26th, maximum runtime is set to 2 days"
SUBMIT_REQUIREMENT_AddCap_IS_WARNING = TRUE
From: Cole Bollig <cabollig@xxxxxxxx>
Date: Wednesday, March 25, 2026 at 9:40âAM
To: Weatherby,Gerard <gweatherby@xxxxxxxx>, HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days
*** Attention: This is an external email. ***
Use caution responding, opening attachments or clicking on links.
Hi Gerard,
My colleague just reminded me this morning of the two following first class JDL commands
allowed_execute_duration (maximum execution time of one job epoch). While this command is intended to be used by user, the AP can set this limit via a submit transform to ensure all jobs placed to that AP has a two day limit:
JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) SetTimeLimit
JOB_TRANSFORM_SetTimeLimit @=end
# Set 2 day limit for any jobs that don't define a max duration or define a duration greater than the two day limit
ââââââREQUIREMENTS AllowedExecuteDuration =?= UNDEFINED || AllowedExecuteDuration > (2 * 24 * 60 * 60)
ââââââEVALSET AllowedExecuteDuration (2 * 24 * 60 * 60)
@end
This should cause non-checkpointing jobs to go on hold with a nice message while making checkpointing jobs go back into the queue for further matchmaking. Note in this sample configuration I am overwriting any user defined execute duration greater than the
desired limit (2 days). If you wanted to make this less silent behavior you could move the second clause of the requirements and put it into an explicit submit requirements but inversed to make job placement fail if the user defined an allowed execution duration
greater than the systems desired limit.
Cheers,
Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, March 24, 2026 3:27 PM
To: gweatherby@xxxxxxxx <gweatherby@xxxxxxxx>; HTCondor Users <htcondor-users@xxxxxxxxxxx>
Cc: Cole Bollig <cabollig@xxxxxxxx>
Subject: Re: [HTCondor-users] Limiting jobs to two days
Hi Gerard,
If you want to control this logic from the Access Point (AP) then you would want to use SYSTEM_PERIODIC_VACATE to kick any jobs exceeding the desired execute time and allow them to go back into the queue for matchmaking. Here in our local CHTC pool we do max
execution timeout on the Execution Point side of things. It would take some time to dig that configuration out and strip out CHTC pool specifics, but it is based on this
2015 HTC presentation.
-Cole Bollig
From: Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Tuesday, March 24, 2026 1:40 PM
To: Cole Bollig <cabollig@xxxxxxxx>; HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days
We want the job to release the scarce resource on the EP (the GPUs) and let other jobs that have been waiting have a turn. Ideally, the job would get back in line. (We will be urging our users to checkpoint their jobs).
From: Cole Bollig <cabollig@xxxxxxxx>
Date: Tuesday, March 24, 2026 at 1:51âPM
To: HTCondor Users <htcondor-users@xxxxxxxxxxx>
Cc: Weatherby,Gerard <gweatherby@xxxxxxxx>
Subject: Re: Limiting jobs to two days
*** Attention: This is an external email. ***
Use caution responding, opening attachments or clicking on links.
Hi Gerard,
-
What do you mean by limit jobs to two days? Do you mean we want to only allow user jobs to execute a max of two days on a EP?
-
What do you want to happen when the max time is reached? Remove the job? Hold the job? Kick the job off the EP for a bit of time?
-Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Weatherby,Gerard via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, March 24, 2026 12:24 PM
To: HTCondor Users <htcondor-users@xxxxxxxxxxx>
Cc: gweatherby@xxxxxxxx <gweatherby@xxxxxxxx>
Subject: [HTCondor-users] Limiting jobs to two days
We want to limit user jobs to two days to more fairly allocate resources. Weâre asking user to checkpoint their jobs if they are going to run longer than that.
Itâs not clear which SYSTEM_PERIODIC_ we should set to best implement this.
-----------------------------------
GERARD WEATHERBY
Application Architect
NMRhub
nmrhub.org
