[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs are reporting cgroups kills at strange limits

Date: Thu, 19 Feb 2026 14:44:37 +0100
From: "svatosm@xxxxxx" <svatosm@xxxxxx>
Subject: Re: [HTCondor-users] Jobs are reporting cgroups kills at strange limits

Hi Thomas,

this is affecting all jobs, not just ATLAS. So, it does not seem to berelated to the pilot.


Michal

On 19/02/2026 14:36, Thomas Hartmann wrote:

Hi Michal (off list),
can you check if the remove might be handled by the atlas pilots? Ifthe kernel cgroup would have triggered, the job should have"disappeared" for condor
The
Â RemoveReason = "via condor_rm (by user atlasprd001)"
is a bit odd - if it would have been the kernel OOM, then I would notexpect a condor_rm
Can you check in the atlas job cgroup dirs, if the pilot has set up adedicated payload scope? [1.a,1.a]
My suspicion would be that the atlas pilot has jailed its payload jobinto a dedicated (sub)scope and is monitoring its memory statistics(and that is actually the pilot or panda (?) which is removing the job)
Cheers,
ÂThomas

[1.a]
[root@batch1505 ~]# ls/sys/fs/cgroup/system.slice/condordesy.service/condorjob.slice/_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxxxxxxxx/cgroup.controllersÂÂÂÂÂ cgroup.threadsÂ cpu.weightmemory.events.localÂ memory.statÂÂÂÂÂÂÂÂÂÂÂÂ pids.currentcgroup.eventsÂÂÂÂÂÂÂÂÂÂ cgroup.typeÂÂÂÂ cpu.weight.nice memory.highÂÂÂÂ memory.swap.currentÂÂÂÂ pids.eventscgroup.freezeÂÂÂÂÂÂÂÂÂÂ controlÂÂÂÂÂÂÂÂ io.bfq.weight memory.low ÂÂÂÂmemory.swap.eventsÂÂÂÂÂ pids.events.localcgroup.killÂÂÂÂÂÂÂÂÂÂÂÂ controllerÂÂÂÂÂ io.latency memory.max ÂÂÂÂmemory.swap.highÂÂÂÂÂÂÂ pids.maxcgroup.max.depthÂÂÂÂÂÂÂ cpu.idleÂÂÂÂÂÂÂ io.max memory.min ÂÂÂÂmemory.swap.maxÂÂÂÂÂÂÂÂ pids.peakcgroup.max.descendantsÂ cpu.maxÂÂÂÂÂÂÂÂ io.stat memory.numa_statÂÂÂÂmemory.swap.peakÂÂÂÂÂÂÂ subprocessescgroup.procsÂÂÂÂÂÂÂÂÂÂÂ cpu.max.burstÂÂ io.weight memory.oom.groupÂÂÂÂmemory.zswap.current_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxxxxxxxxcgroup.statÂÂÂÂÂÂÂÂÂÂÂÂ cpu.statÂÂÂÂÂÂÂ memory.current memory.peakÂÂÂÂ memory.zswap.maxcgroup.subtree_controlÂ cpu.stat.localÂ memory.events memory.reclaimÂÂÂÂ memory.zswap.writeback
[1.b]
[root@batch1505 ~]# ls/sys/fs/cgroup/system.slice/condordesy.service/condorjob.slice/_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxxxxxxxx/_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxxxxxxxxcgroup.controllersÂÂÂÂÂ cgroup.procsÂÂÂÂÂÂÂÂÂÂÂ cpu.maxio.bfq.weightÂÂ memory.eventsÂÂÂÂÂÂÂ memory.numa_statmemory.swap.eventsÂÂÂ memory.zswap.writebackcgroup.eventsÂÂÂÂÂÂÂÂÂÂ cgroup.statÂÂÂÂÂÂÂÂÂÂÂÂ cpu.max.burstio.latencyÂÂÂÂÂ memory.events.localÂ memory.oom.groupmemory.swap.highÂÂÂÂÂ pids.currentcgroup.freezeÂÂÂÂÂÂÂÂÂÂ cgroup.subtree_controlÂ cpu.stat io.maxÂÂÂÂÂÂÂ memory.highÂÂÂÂÂÂÂÂÂ memory.peak memory.swap.max pids.eventscgroup.killÂÂÂÂÂÂÂÂÂÂÂÂ cgroup.threadsÂÂÂÂÂÂÂÂÂ cpu.stat.local io.statÂÂÂÂÂÂÂ memory.lowÂÂÂÂÂÂÂÂÂÂ memory.reclaim memory.swap.peakpids.events.localcgroup.max.depthÂÂÂÂÂÂÂ cgroup.typeÂÂÂÂÂÂÂÂÂÂÂÂ cpu.weightio.weightÂÂÂÂÂÂ memory.maxÂÂÂÂÂÂÂÂÂÂ memory.stat memory.zswap.currentÂpids.maxcgroup.max.descendantsÂ cpu.idleÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ cpu.weight.nicememory.currentÂ memory.minÂÂÂÂÂÂÂÂÂÂ memory.swap.currentmemory.zswap.maxÂÂÂÂÂ pids.peak
OnÂ 2026-02-19 13:12, svatosm@xxxxxx wrote:
Hi,
we would like to ask for advice with problem we see in our HTCondorinstallation in Prague. We set up cgroups to watch job memory limitsand kill jobs when they go over the requested amount + 10% extramargin. But when we are looking at the information about killed jobsin history files, we get strange messages in LastHoldReason like this:
Error from slot1_10@xxxxxxxxxxxxxxxxxxxxxxxx: Job has gone overcgroup memory limit of 16000 megabytes. Last measured usage: 1085megabytes. Consider resubmitting with a higher request_memory.
All the data from history files for that job are below. So, there arecouple of odd things in it. First, the RequestMemory is 1000 and thelast measured value is still below 110% of the request. Then theorigin of the 16000MB limit in the message. Would anyone know andexplanation for that?
thanks

Michal Svatos


AccountingGroup = "group_atlas.prod.atlasprd001"
AcctGroup = "group_atlas.prod"
AcctGroupUser = "atlasprd001"
ActivationDuration = 15628
ActivationSetupDuration = 2
ActivationTeardownDuration = 1769446065
Arguments = ""
BlockReadKbytes = 0
BlockReads = 0
BlockWriteKbytes = 0
BlockWrites = 0
BytesRecvd = 107197.0
BytesSent = 0.0
ClusterId = 4218714
Cmd = "/var/spool/arc/session/2bcb04c60a3e/condorjob.sh"
CommittedSlotTime = 0
CommittedSuspensionTime = 0
CommittedTime = 0
CondorPlatform = "$CondorPlatform: x86_64_AlmaLinux9 $"
CondorVersion = "$CondorVersion: 24.0.15 2025-12-12 BuildID: 856728PackageID: 24.0.15-1 GitSHA: b96c5ce3 $"
CpusProvisioned = 8
CumulativeRemoteSysCpu = 832.0
CumulativeRemoteUserCpu = 15572.0
CumulativeSlotTime = 125040.0
CumulativeSuspensionTime = 0
CurrentHosts = 0
DiskProvisioned = 167779148
DiskUsage = 225000
DiskUsage_RAW = 216908
EnteredCurrentStatus = 1770035531
Environment = ""
Err = "/var/spool/arc/session/2bcb04c60a3e.comment"
ExecutableSize = 22
ExecutableSize_RAW = 21
ExecuteDirWasEncrypted = false
ExitBySignal = false
ExitStatus = 0
GPUsProvisioned = 0
GlobalJobId = "arc1.farm.particle.cz#4218714.0#1769417681"
ImageSize = 1250000
ImageSize_RAW = 1111608
In = "/dev/null"
Iwd = "/var/spool/arc/session/2bcb04c60a3e"
JobCpuLimit = 345600
JobCurrentFinishTransferInputDate = 1769430438
JobCurrentReconnectAttempt = undefined
JobCurrentStartDate = 1769430436
JobCurrentStartExecutingDate = 1769430439
JobCurrentStartTransferInputDate = 1769430438
JobDescription = "gridjob"
JobFinishedHookDone = 1770035532
JobLeaseDuration = 2400
JobMemoryLimit = 1024000
JobNotification = 0
JobPrio = 0
JobRunCount = 1
JobStartDate = 1769430436
JobStatus = 3
JobSubmitFile = "/var/spool/arc/session/2bcb04c60a3e/condorjob.jdl"
JobSubmitMethod = 0
JobTimeLimit = 345600
JobUniverse = 5
LastHoldReason = "Error from slot1_10@xxxxxxxxxxxxxxxxxxxxxxxx: Jobhas gone over cgroup memory limit of 16000 megabytes. Last measuredusage: 1085 megabytes.Â Consider resubmitting with a higherrequest_memory."
LastHoldReasonCode = 34
LastHoldReasonSubCode = 102
LastJobLeaseRenewal = 1769446065
LastJobStatus = 5
LastMatchTime = 1769430436
LastPublicClaimId = "<172.16.17.4:9618?addrs=[2001-718-401-6017-20-0-17-4]-9618+172.16.17.4-9618&alias=turin04.farm.particle.cz&noUDP&sock=startd_4127_5b71>#1769425273#107#..."
LastRejMatchReason = "no match found "
LastRejMatchTime = 1769430402
LastRemoteHost = "slot1_10@xxxxxxxxxxxxxxxxxxxxxxxx"
LastRemoteWallClockTime = 15630.0
LastSuspensionTime = 0
LastVacateTime = 1769446065
LeaveJobInQueue = false
MATCH_EXP_MachineScalingFactorFZU = "2.174242424242424E+00"
MATCH_EXP_MachineScalingFactorHEPSPEC06 = "2.296000000000000E+01"
MATCH_EXP_MachineScalingSlotWeight = "8"
MachineAttrCpus0 = 8
MachineAttrScalingFactorFZU0 = 2.174242424242424
MachineAttrScalingFactorHEPSPEC060 = 22.96
MachineAttrSlotWeight0 = 8
MachineScalingFactorFZU ="$$([ifThenElse(isUndefined(ScalingFactorFZU), 1.00,ScalingFactorFZU)])"MachineScalingFactorHEPSPEC06 = "$$([ifThenElse(isUndefined(ScalingFactorHEPSPEC06), 10.56,ScalingFactorHEPSPEC06)])"MachineScalingSlotWeight = "$$([ifThenElse(isUndefined(SlotWeight),0.00, SlotWeight)])"
MaxHosts = 1
MemoryProvisioned = 16000
MemoryUsage = ((ResidentSetSize + 1023) / 1024)
MinHosts = 1
MyType = "Job"
NordugridQueue = "grid"
NumCkpts = 0
NumCkpts_RAW = 0
NumHolds = 1
NumHoldsByReason = [ JobOutOfResources = 1 ]
NumJobCompletions = 0
NumJobMatches = 1
NumJobStarts = 1
NumRestarts = 0
NumShadowStarts = 1
NumSystemHolds = 0
OrigMaxHosts = 1
Out = "/var/spool/arc/session/2bcb04c60a3e.comment"
Owner = "atlasprd001"
PeriodicRemove = (JobStatus == 1 && NumJobStarts > 0) ||RemoteUserCpu + RemoteSysCpu > JobCpuLimit || RemoteWallClockTime >JobTimeLimit || (JobStatus == 1 && NumJobStarts > 0)
ProcId = 0
QDate = 1769417681
Rank = 0.0
RecentBlockReadKbytes = 0
RecentBlockReads = 0
RecentBlockWriteKbytes = 0
RecentBlockWrites = 0
RecentStatsLifetimeStarter = 1200
RemoteSysCpu = 832.0
RemoteUserCpu = 15572.0
RemoteWallClockTime = 15630.0
RemoveReason = "via condor_rm (by user atlasprd001)"
RequestCpus = 1
RequestDisk = 20971520 * RequestCpus
RequestMemory = 1000
Requirements = ((NumJobStarts == 0) && (((Arch == "X86_64") && (OpSys=? = "LINUX") && ((OpSysName =?= "CentOS") || (OpSysName =?="AlmaLinux")) && (OpSysMajorVer =?= 9)))) && (TARGET.Disk >=RequestDisk) && (TARGET.Memory >= RequestMemory) &&(TARGET.HasFileTransfer) && (NumJobStarts == 0)
ResidentSetSize = 1000000
ResidentSetSize_RAW = 997532
ScratchDirFileCount = 2727
ShouldTransferFiles = "YES"
StartdPrincipal ="execute-side@matchsession/2001:718:401:6017:20:0:17:4"
StatsLifetimeStarter = 15626
StreamErr = false
StreamOut = false
TargetType = "Machine"
TotalSubmitProcs = 1
TotalSuspensions = 0
TransferIn = false
TransferInFinished = 1769430438
TransferInStarted = 1769430438
TransferInput = "/var/spool/arc/session/2bcb04c60a3e"
TransferInputSizeMB = 0
TransferInputStats = [ CedarFilesCountTotal = 9;CedarFilesCountLastRun = 9 ]
TransferOutputStats = [Â ]
User = "atlasprd001@xxxxxxxxxxxxxxxx"
UserLog = "/var/spool/arc/session/2bcb04c60a3e/log"
VacateReason = "Error from slot1_10@xxxxxxxxxxxxxxxxxxxxxxxx: Job hasgone over cgroup memory limit of 16000 megabytes. Last measuredusage: 1085 megabytes.Â Consider resubmitting with a higherrequest_memory."
VacateReasonCode = 34
VacateReasonSubCode = 102
WhenToTransferOutput = "ON_EXIT_OR_EVICT"
use_x509userproxy = true
x509UserProxyEmail = "atlas.pilot1@xxxxxxx"
x509UserProxyExpiration = 1769760905
x509UserProxyFQAN = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1,/atlas/Role=production/Capability=NULL,/atlas/Role=NULL/Capability=NULL,/atlas/usatlas/Role=NULL/Capability=NULL"
x509UserProxyFirstFQAN = "/atlas/Role=production/Capability=NULL"
x509UserProxyVOName = "atlas"
x509userproxy = "/var/spool/arc/session/2bcb04c60a3e/user.proxy"
x509userproxysubject = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1"*** Offset = 29744629 ClusterId = 4218714 ProcId = 0 Owner ="atlasprd001" CompletionDate = -1
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxxwith a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

References:
- [HTCondor-users] Jobs are reporting cgroups kills at strange limits
  - From: svatosm@xxxxxx
- Re: [HTCondor-users] Jobs are reporting cgroups kills at strange limits
  - From: Thomas Hartmann

Prev by Date: Re: [HTCondor-users] Jobs are reporting cgroups kills at strange limits
Next by Date: Re: [HTCondor-users] Jobs are reporting cgroups kills at strange limits
Previous by thread: Re: [HTCondor-users] Jobs are reporting cgroups kills at strange limits
Next by thread: [HTCondor-users] Lost slot claims
Index(es):
- Date
- Thread