Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Issues with cgroup memory limits, qedit memory units and expiring scitokens

Date: Tue, 2 Apr 2024 08:37:28 -0700
From: Joseph Areeda <newsreply@xxxxxxxxxx>
Subject: [HTCondor-users] Issues with cgroup memory limits, qedit memory units and expiring scitokens

Hi all,

I apologize for the multiple issues in a single email. I have multiple jobs in similar condition and I hoping for enough insight to break this down into separate issues.

Situation:

I am working on creating a hveto job that runs on a weeks worth of data as well as the current 24 hrs. It is a 3 sequential job DAG.

The memory required by Job is data dependent.

Running 7 DAGs with RequestMemory=4G, 2 of them completed the other 5 are hung in the [condor] Running state. condor_ssh_to_job showed the python executables in the [ps] interruptible wait state at about 14hr runtime now at 20hrs trying to ssh to the job puts it on hold withe the message:

Job has gone over cgroup memory limit of 4096 megabytes. Peak usage: 4097 megabytes.Â Consider resubmitting with a higher request_memory.

the same thing happened when I used condor_vacate_job on one of them.

Problems:

How can we deal with being stuck in the run state?
When I try to qedit the RequestMemory classad values like 6G, 6GB, 6000M or 6000MB are accepted but -better-analyze says:

1102.000:Â Run analysis summary ignoring user priority.Â Of 550 machines,
ÂÂÂ 550 are rejected by your job's requirements
ÂÂÂÂÂ 0 reject your job because of their own requirements
ÂÂÂÂÂ 0 match and are already running your jobs
ÂÂÂÂÂ 0 match but are serving other users
ÂÂÂÂÂ 0 are able to run your job

WARNING:Â Be advised:
ÂÂ No machines matched the jobs's constraints

While setting it to 6000:

1102.000:Â Run analysis summary ignoring user priority.Â Of 550 machines,
ÂÂÂÂÂ 1 are rejected by your job's requirements
ÂÂÂÂ 14 reject your job because of their own requirements
ÂÂÂÂÂ 0 match and are already running your jobs
ÂÂÂÂÂ 0 match but are serving other users
ÂÂÂ 535 are able to run your job

When I get that resolved the

Follow-Ups:
- Re: [HTCondor-users] Issues with cgroup memory limits, qedit memory units and expiring scitokens
  - From: Greg Thain

Prev by Date: Re: [HTCondor-users] can't get flocking to submit jobs
Next by Date: [HTCondor-users] Compute topology(?) questions and resource matching
Previous by thread: Re: [HTCondor-users] can't get flocking to submit jobs
Next by thread: Re: [HTCondor-users] Issues with cgroup memory limits, qedit memory units and expiring scitokens
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] Issues with cgroup memory limits, qedit memory units and expiring scitokens

Situation:

Problems: