Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unexpected hold on jobs

Date: Wed, 21 Jun 2023 14:47:24 +0200
From: Thomas Hartmann <thomas.hartmann@xxxxxxx>
Subject: Re: [HTCondor-users] Unexpected hold on jobs

Hi Thomas,

can you check, if your jobs cgroups have OOM set in their cgroup limitsrather than Condor's memory watchdog?

i.e., if there is a limit set in a process'
  memory.limit_in_bytes

e.g., at us it looks like

/sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_25@xxxxxxxxxxxxxxxxx/memory.limit_in_bytes
but probably your Docker set up is on a different path

The path should be under the cgroup mount
  > mount | grep cgroup | grep memory
plus a job's process sub-path from
  > grep memory /proc/{PID}/cgroup

Cheers,
  Thomas

On 21/06/2023 12.41, Thomas Birkett - STFC UKRI via HTCondor-users wrote:

Hi Condor Community,
I have an odd issue with a small percentage of jobs we run. We have asmall subset of jobs that go on hold due to resource being exceeded, forexample:
LastHoldReason = "Error from slot1_38@xxxxxxxxxxxxxxxxxxxxxxx<mailto:slot1_38@xxxxxxxxxxxxxxxxxxxxxxx>: Docker job has gone overmemory limit of 4100 Mb"
However, we havenât configured any resource limits to hold jobs. I alsonotice the only ClassAd that appears to match the memory limit is:
MemoryProvisioned = 4100
These jobs are then removed by a SYSTEM_PERIODIC_REMOVE statement toclear down held jobs. My question to the community is why is the jobgoing on hold in the first place? The only configured removal limit /PeriodicRemove statement we configure is on a per job level shown below:
PeriodicRemove = (JobStatus == 1 && NumJobStarts > 0) ||((ResidentSetSize =!= undefined ? ResidentSetSize : 0) > JobMemoryLimit)
I cannot replicate this behaviour in my testing, and I cannot find anyreason why the job went on hold.
Researching the relevant classads, I see:

MemoryProvisioned
The amount of memory in MiB allocated to the job. Withstatically-allocated slots, it is the amount of memory space allocatedto the slot. With dynamically-allocated slots, it is based upon the jobattribute RequestMemory, but may be larger due to the minimum given to adynamic slot.
At our site we dynamically assign our slots and the Request memory forthis job is âRequestMemory = 4096â. I find this even more perplexing asthis is a very rare issue with over 90% of the jobs working well andcompleting, same job type, same VO, same config. Any assistancedebugging this issue will be gratefully received.
Many thanks,

*Thomas Birkett*

Senior Systems Administrator

Scientific Computing Department

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, Didcot
OX11 0QX

signature_609518872


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: [HTCondor-users] Unexpected hold on jobs
  - From: Thomas Birkett - STFC UKRI

References:
- [HTCondor-users] Unexpected hold on jobs
  - From: Thomas Birkett - STFC UKRI

Prev by Date: Re: [HTCondor-users] Unexpected hold on jobs
Next by Date: Re: [HTCondor-users] job factory universe changed?
Previous by thread: Re: [HTCondor-users] Unexpected hold on jobs
Next by thread: Re: [HTCondor-users] Unexpected hold on jobs
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Unexpected hold on jobs