Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Issues with cgroup memory limits, qedit memory units and expiring scitokens

Date: Mon, 8 Apr 2024 15:50:49 -0500
From: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Issues with cgroup memory limits, qedit memory units and expiring scitokens

On 4/2/24 10:37, Joseph Areeda wrote:

condor_ssh_to_job showed the python executables in the [ps] interruptible wait state at about 14hr runtime now at 20hrs trying to ssh to the job puts it on hold withe the message:

Job has gone over cgroup memory limit of 4096 megabytes. Peak usage: 4097 megabytes. Consider resubmitting with a higher request_memory.

Note that condor_ssh_to_job puts the interactive job in the same cgroup as the job proper, so if the job is close to hitting the OOM killer and going out of memory, ssh'ing to_job might nudge the job over the limit. Feels like that's what is going on here.

Problems:

How can we deal with being stuck in the run state?

We need to figure out why the jobs aren't making progress. Could they be blocked on I/O that's never coming back?

When I try to qedit the RequestMemory classad values like 6G, 6GB, 6000M or 6000MB are accepted but -better-analyze says:

Sadly, condor_qedit only knows about the classad language, it doesn't know about the submit language, and the unit suffices only exist in the submit language.

-greg

References:
- [HTCondor-users] Issues with cgroup memory limits, qedit memory units and expiring scitokens
  - From: Joseph Areeda

Prev by Date: Re: [HTCondor-users] CPUs total for machines with Partitionable Slots?
Next by Date: [HTCondor-users] HTC 24: Connect with the High Throughput Computing Community
Previous by thread: [HTCondor-users] Issues with cgroup memory limits, qedit memory units and expiring scitokens
Next by thread: [HTCondor-users] Compute topology(?) questions and resource matching
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Issues with cgroup memory limits, qedit memory units and expiring scitokens

Problems: