Re: [HTCondor-users] Limiting jobs to two days

From: Cole Bollig <cabollig@xxxxxxxx>
Date: Friday, March 27, 2026 at 9:39âAM
To: Weatherby,Gerard <gweatherby@xxxxxxxx>, HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days

*** Attention: This is an external email. ***

Use caution responding, opening attachments or clicking on links.

Report Suspicious

Hi Gerard,

If you check the AP's shadow log, do you see any messages containing 'The job exceeded allowed execute duration of'. This should be followed by a 'Sending DEACTIVATE_CLAIM to startd' message. I just want to confirm the shadow is properly triggering the allowed execute duration. What is likely happening is on the EP side is getting the deactivate claim and then allowing the job to execute for a certain amount of time extra due to a max vacate time. What does condor_config_val -dump vacate and condor_config_val kill on your EP say?

-Cole Bollig

From: Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Thursday, March 26, 2026 10:56 AM
To: Cole Bollig <cabollig@xxxxxxxx>; HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days

Running 25.7.2-1+ubu24 on Ubuntu24 on central manager, access point, and execute node.

I set a test job to 60 and itâs running over an hour:

ActivationDuration = 5

ActivationExecutionDuration = 5

ActivationSetupDuration = 0

ActivationTeardownDuration = 0

AllowedExecuteDuration = 60

Args = "--cpu --onsig"

AuthTokenId = "da90ddd6609bb3f39bd86b7caf08dc30"

AuthTokenIssuer = "condor-mgr-lts.nmrbox.org"

AuthTokenSubject = "gweatherby@xxxxxxxxxx"

AutoClusterAttrs = "FirstUpdateUptimeGPUsSeconds,LastUpdateUptimeGPUsSeconds,Production,RequestCpus,RequestDisk,RequestGPUs,RequestMemory,StartOfJobUptimeGPUsSeconds,UptimeGPUsSeconds,ConcurrencyLimits,FlockTo,Rank,Requirements,ChtcProjects,ContainerImageSource,DockerImage,GlideinFactory,GPUs_Capability,GPUs_DeviceName,GPUs_DriverVersion,GPUs_GlobalMemoryMb,GPUs_MaxSupportedVersion,InteractiveJob,is_resumable,IsBuildJob,LongJob,Owner,PreventJobsReason,PrioritizedProjects,profiling,ProjectName,RequestIoHeavy,want_campus_pools,want_ospool,WantFlocking,WantGlidein,GPUJobLength,WantGPULab,HasRaddusHtcCephFS,FileSystemDomain,Machine,TransferInputSizeMB"

AutoClusterId = 3

BlockReadKbytes = 0

BlockReads = 0

BlockWriteKbytes = 0

BlockWrites = 0

BytesRecvd = 0.0

BytesSent = 0.0

ChtcProjects = undefined

ClusterId = 995417

Cmd = "/home/nmrbox/gweatherby/condor/signal_catcher.py"

CommittedSlotTime = 0

CommittedSuspensionTime = 0

CommittedTime = 0

CondorPlatform = "$CondorPlatform: X86_64-Ubuntu_24.04 $"

CondorVersion = "$CondorVersion: 25.7.2 2026-03-11 BuildID: 881773 PackageID: 25.7.2-1+ubu24 GitSHA: 433de0b3 $"

CpusProvisioned = 1

CpusUsage = 1.000184786082713

CumulativeRemoteSysCpu = 0.0

CumulativeRemoteUserCpu = 6067.0

CumulativeSlotTime = 2761.0

CumulativeSuspensionTime = 0

CurrentHosts = 1

DiskProvisioned = 1048576

DiskUsage = 2

DiskUsage_RAW = 2

EnteredCurrentStatus = 1774536434 (2026-03-26 10:47:14)

Environment = ""

Err = "/dev/null"

ExecutableSize = 2

ExecutableSize_RAW = 2

ExecuteDirWasEncrypted = false

ExitBySignal = false

ExitCode = 15

ExitReason = "died on signal 9 (Killed)"

ExitSignal = 9

ExitStatus = 0

FileSystemDomain = "nmrbox.org"

FirstJobMatchDate = 1774533302 (2026-03-26 09:55:02)

GlobalJobId = "condor-ap-lts.nmrbox.org#995417.0#1774533301"

GPUsProvisioned = 0

ImageSize = 7500

ImageSize_RAW = 5416

In = "/dev/null"

InitialWaitDuration = 1

Iwd = "/home/nmrbox/gweatherby/condor"

JobCurrentReconnectAttempt = undefined

JobCurrentStartDate = 1774536434 (2026-03-26 10:47:14)

JobCurrentStartExecutingDate = 1774536434 (2026-03-26 10:47:14)

JobLastStartDate = 1774536233 (2026-03-26 10:43:53)

JobLeaseDuration = 2400

JobNotification = 0

JobPrio = 0

JobRunCount = 3

JobStartDate = 1774533302 (2026-03-26 09:55:02)

JobStatus = 2

JobSubmitFile = "jobsig"

JobSubmitMethod = 0

JobUniverse = 5

LastJobLeaseRenewal = 1774540385 (2026-03-26 11:53:05)

LastJobStatus = 1

LastMatchTime = 1774536434 (2026-03-26 10:47:14)

LastPublicClaimId = "<155.37.253.100:9618?addrs=155.37.253.100-9618&alias=argon.nmrbox.org&noUDP&sock=startd_58990_cfd8>#1774536063#1#..."

LastRejMatchNegotiator = "condor-mgr-lts.nmrbox.org"

LastRejMatchReason = "no match found"

LastRejMatchTime = 1774536434 (2026-03-26 10:47:14)

LastRemoteHost = "slot1_1@xxxxxxxxxxxxxxxx"

LastRemoteWallClockTime = 6.0

LastSuspensionTime = 0

LastVacateTime = 1774536057 (2026-03-26 10:40:57)

LeaveJobInQueue = false

MachineAttrCpus0 = 1

MachineAttrSlotWeight0 = 1

MaxHosts = 1

MemoryProvisioned = 2048

MemoryUsage = ((ResidentSetSize + 1023) / 1024)

MinHosts = 1

MyType = "Job"

NumCkpts = 0

NumCkpts_RAW = 0

NumJobCompletions = 0

NumJobMatches = 3

NumJobStarts = 3

NumRestarts = 0

NumShadowStarts = 3

NumSystemHolds = 0

NumVacates = 2

NumVacatesByReason = [ StartdShutdown = 1; StartdPreemptExpression = 1 ]

OrigMaxHosts = 1

Out = "/home/nmrbox/gweatherby/condor/sig_995417.out"

Owner = "gweatherby"

ProcId = 0

Production = false

ProjectName = "NMRbox"

PublicClaimId = "<155.37.253.100:9618?addrs=155.37.253.100-9618&alias=argon.nmrbox.org&noUDP&sock=startd_71281_6a63>#1774536390#1#..."

QDate = 1774533301 (2026-03-26 09:55:01)

Rank = 0.0

RecentBlockReadKbytes = 0

RecentBlockReads = 0

RecentBlockWriteKbytes = 0

RecentBlockWrites = 0

RecentStatsLifetimeStarter = 1200

RemoteHost = "slot1_1@xxxxxxxxxxxxxxxx"

RemoteSlotID = 1

RemoteSysCpu = 0.0

RemoteUserCpu = 3311.0

RemoteWallClockTime = 2761.0

RequestCpus = 1

RequestDisk = MAX({ 1024,(TransferInputSizeMB + 1) * 1.25 }) * 1024

RequestMemory = 2048

Requirements = ((Machine == "argon.nmrbox.org")) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))

ResidentSetSize = 7500

ResidentSetSize_RAW = 5416

ServerTime = 1774540508 (2026-03-26 11:55:08)

ShadowBday = 1774536434 (2026-03-26 10:47:14)

ShouldTransferFiles = "IF_NEEDED"

StartdIpAddr = "<155.37.253.100:9618?addrs=155.37.253.100-9618&alias=argon.nmrbox.org&noUDP&sock=startd_71281_6a63>"

StartdPrincipal = "execute-side@matchsession/155.37.253.100"

StatsLifetimeStarter = 3315

StreamOut = false

TargetType = "Machine"

TotalSubmitProcs = 1

TotalSuspensions = 0

TransferErr = false

TransferIn = false

TransferInputFileCounts = [ CEDAR = 1 ]

TransferInputSizeMB = 1

TransferInputStats = [ ]

TransferOutputStats = [ ]

User = "gweatherby@xxxxxxxxxx"

UserLog = "/home/nmrbox/gweatherby/condor/sig_995417.log"

WhenToTransferOutput = "ON_EXIT"

From: Cole Bollig <cabollig@xxxxxxxx>
Date: Thursday, March 26, 2026 at 11:13âAM
To: Weatherby,Gerard <gweatherby@xxxxxxxx>, HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days

*** Attention: This is an external email. ***

Use caution responding, opening attachments or clicking on links.

Report Suspicious

Hi Gerard,

One possible reason is AllowedExecuteDuration only applies for actual job execution time and not the entire running state, so input file transfer times are not accounted for in the enforcement. This matters if you want to limit the total time a job is on an EP as opposed to the total time a job has to be executed. EPs with versions that don't know how to use this functionality should hopefully not be a factor unless you have EPs running versions older than v9.4.1 or v9.5.0.

-Cole Bollig

From: Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Thursday, March 26, 2026 9:49 AM
To: Cole Bollig <cabollig@xxxxxxxx>; HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days

Hi Cole,

That works great. AllowedExecuteDuration is being as desired.

Our cluster does not seem to be consistently limiting the jobs to the specified duration. Are there value(s) that have to be set for HTCondor to monitor AllowedExecuteDuration?

From: Cole Bollig <cabollig@xxxxxxxx>
Date: Wednesday, March 25, 2026 at 2:16âPM
To: Weatherby,Gerard <gweatherby@xxxxxxxx>, HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days

*** Attention: This is an external email. ***

Use caution responding, opening attachments or clicking on links.

Report Suspicious

Hi Gerard,

The issue here is that the submit transforms take place before the submit requirements check. Since the transform defines AllowedExecuteDuration with a default value, AllowedExecuteDuration is never UNDEFINED. I was able to achieve your desired behavior with the following configuration (complete):

JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) SetTimeLimit

JOB_TRANSFORM_SetTimeLimit @=end

ââââââREQUIREMENTS AllowedExecuteDuration =?= UNDEFINED

ââââââEVALSET AllowedExecuteDuration (2 * 24 * 60 * 60)

ââââââSET UsingDefaultMaxRuntime True

@end

SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) AddCap MaxExecuteDuration

# Submit requirement: explicitly reject jobs that define a duration exceeding the site limit

SUBMIT_REQUIREMENT_MaxExecuteDuration = AllowedExecuteDuration <= (2 * 24 * 60 * 60)

SUBMIT_REQUIREMENT_MaxExecuteDuration_REASON = strcat("AllowedExecuteDuration of ", AllowedExecuteDuration, " seconds exceeds the site limit of ", (2 * 24 * 60 * 60), " seconds (2 days).")

# Submit warning: inform user of default max 2 day runtime

SUBMIT_REQUIREMENT_AddCap = UsingDefaultMaxRuntime =!= True

SUBMIT_REQUIREMENT_AddCap_REASON = "As of Mar 26th, maximum runtime is set to 2 days"

SUBMIT_REQUIREMENT_AddCap_IS_WARNING = TRUE

Note: This also has the added benefit of allowing you to query the Schedd or history for the UsingDefaultMaxRuntime attribute to see how many jobs or what users are not explicitly setting a maximum runtime (if you desire to do some analysis in that field of information).

Cheers,

Cole Bollig

From: Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Wednesday, March 25, 2026 12:18 PM
To: Cole Bollig <cabollig@xxxxxxxx>; HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days

Cole,

Fantastic. Weâve been able to get it to work:

# Transform: only set the 2-day default when the user has NOT defined a duration

JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) SetTimeLimit

JOB_TRANSFORM_SetTimeLimit @=end

REQUIREMENTS AllowedExecuteDuration =?= UNDEFINED

EVALSET AllowedExecuteDuration (2 * 24 * 60 * 60)

@end

# Submit requirement: explicitly reject jobs that define a duration exceeding the site limit

SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) MaxExecuteDuration

SUBMIT_REQUIREMENT_MaxExecuteDuration = AllowedExecuteDuration <= (2 * 24 * 60 * 60)

SUBMIT_REQUIREMENT_MaxExecuteDuration_REASON = strcat( \

"AllowedExecuteDuration of ", AllowedExecuteDuration, \

" seconds exceeds the site limit of ", (2 * 24 * 60 * 60), " seconds (2 days).")

The last piece weâd like to do is notify the user of the transform setting the 2-day limit. I tried this, but it did not seem to work:

SUBMIT_REQUIREMENT_NAMES = AddCap $(SUBMIT_REQUIREMENT_NAMES)

SUBMIT_REQUIREMENT_AddCap = AllowedExecuteDuration =!= UNDEFINED

SUBMIT_REQUIREMENT_AddCap_REASON = "As of Mar 26th, maximum runtime is set to 2 days"

SUBMIT_REQUIREMENT_AddCap_IS_WARNING = TRUE

From: Cole Bollig <cabollig@xxxxxxxx>
Date: Wednesday, March 25, 2026 at 9:40âAM
To: Weatherby,Gerard <gweatherby@xxxxxxxx>, HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days

*** Attention: This is an external email. ***

Use caution responding, opening attachments or clicking on links.

Report Suspicious

Hi Gerard,

My colleague just reminded me this morning of the two following first class JDL commands allowed_execute_duration (maximum execution time of one job epoch). While this command is intended to be used by user, the AP can set this limit via a submit transform to ensure all jobs placed to that AP has a two day limit:

JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) SetTimeLimit

JOB_TRANSFORM_SetTimeLimit @=end

# Set 2 day limit for any jobs that don't define a max duration or define a duration greater than the two day limit

ââââââREQUIREMENTS AllowedExecuteDuration =?= UNDEFINED || AllowedExecuteDuration > (2 * 24 * 60 * 60)

ââââââEVALSET AllowedExecuteDuration (2 * 24 * 60 * 60)

@end

This should cause non-checkpointing jobs to go on hold with a nice message while making checkpointing jobs go back into the queue for further matchmaking. Note in this sample configuration I am overwriting any user defined execute duration greater than the desired limit (2 days). If you wanted to make this less silent behavior you could move the second clause of the requirements and put it into an explicit submit requirements but inversed to make job placement fail if the user defined an allowed execution duration greater than the systems desired limit.

Cheers,

Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, March 24, 2026 3:27 PM
To: gweatherby@xxxxxxxx <gweatherby@xxxxxxxx>; HTCondor Users <htcondor-users@xxxxxxxxxxx>
Cc: Cole Bollig <cabollig@xxxxxxxx>
Subject: Re: [HTCondor-users] Limiting jobs to two days

Hi Gerard,

If you want to control this logic from the Access Point (AP) then you would want to use SYSTEM_PERIODIC_VACATE to kick any jobs exceeding the desired execute time and allow them to go back into the queue for matchmaking. Here in our local CHTC pool we do max execution timeout on the Execution Point side of things. It would take some time to dig that configuration out and strip out CHTC pool specifics, but it is based on this 2015 HTC presentation.

-Cole Bollig

From: Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Tuesday, March 24, 2026 1:40 PM
To: Cole Bollig <cabollig@xxxxxxxx>; HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: Re: Limiting jobs to two days

We want the job to release the scarce resource on the EP (the GPUs) and let other jobs that have been waiting have a turn. Ideally, the job would get back in line. (We will be urging our users to checkpoint their jobs).

From: Cole Bollig <cabollig@xxxxxxxx>
Date: Tuesday, March 24, 2026 at 1:51âPM
To: HTCondor Users <htcondor-users@xxxxxxxxxxx>
Cc: Weatherby,Gerard <gweatherby@xxxxxxxx>
Subject: Re: Limiting jobs to two days

*** Attention: This is an external email. ***

Use caution responding, opening attachments or clicking on links.

Report Suspicious

Hi Gerard,

What do you mean by limit jobs to two days? Do you mean we want to only allow user jobs to execute a max of two days on a EP?
What do you want to happen when the max time is reached? Remove the job? Hold the job? Kick the job off the EP for a bit of time?

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Weatherby,Gerard via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, March 24, 2026 12:24 PM
To: HTCondor Users <htcondor-users@xxxxxxxxxxx>
Cc: gweatherby@xxxxxxxx <gweatherby@xxxxxxxx>
Subject: [HTCondor-users] Limiting jobs to two days

We want to limit user jobs to two days to more fairly allocate resources. Weâre asking user to checkpoint their jobs if they are going to run longer than that.

Itâs not clear which SYSTEM_PERIODIC_ we should set to best implement this.

-----------------------------------

GERARD WEATHERBY

Application Architect

NMRhub

nmrhub.org