[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] periodic_remove of memory overuse job is not working



Dear Christoph

Thank you for the comment. What do you mean by `on the sched`?
Fyi, we start a HTCondor worker on a SLURM compute node like

```
#!/bin/bash
#SBATCH --time=72:00:00 ... --job-name=jaws_condor_worker --exclusive...
...
export CONDOR_CONFIG=/.../config/htcondor_worker.conf
...
condor_master -f -n jaws_htcondor_worker_$HOSTNAME
```
and we have the original memory limit policy in the `htcondor_worker.conf` file.

How can we specify policies for specifically SCHEDD?


Thank you!
Best,
Seung-Jin Sul, Ph.D.
pronouns: he / his
510-495-8456ÂÂ Â | Â Âssul@xxxxxxx
Staff Software Developer, Tech Lead
Advanced Analysis Group
DOE Joint Genome Institute
Lawrence Berkeley National Lab


On Fri, Feb 7, 2025 at 12:50âPM Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
Hi,

on the sched you could use something like:

HoldOverMem = (ifThenElse(ResidentSetSize =!= UNDEFINED, ResidentSetSize,1) > 1100 * RequestMemory)
HoldOverMemReason = "Memory usage higher than 1.1 x requested memory"
SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD) || $(HoldOverMem)
SYSTEM_PERIODIC_HOLD_REASON = ifThenElse($(HoldOverMem), $(HoldOverMemReason), \
ÂÂÂ $(SYSTEM_PERIODIC_HOLD_REASON) )

Best
christoph
--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Seung-Jin Sul" <ssul@xxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>, "Ramani Kothadia" <ramanik@xxxxxxx>, "Daniela Cassol" <dcassol@xxxxxxx>
Gesendet: Freitag, 7. Februar 2025 20:00:45
Betreff: [HTCondor-users] periodic_remove of memory overuse job is notÂÂÂÂÂÂÂÂworking

Hi,

We are using HTCondor with a SLURM glide-in and are trying to set the memory usage limit policy like the below
```
# Check jobs if they are using more than 10% over memory assigned to the slot
MEMORY_EXCEEDED_SLOT = ( isDefined(MemoryUsage) && isDefined(Memory) && MemoryUsage*1.1 > Memory )
# Check jobs if they are using more than 10% over the requested memory
MEMORY_EXCEEDED_REQ = ( isDefined(MemoryUsage) && isDefined(RequestMemory) && MemoryUsage > RequestMemory*1.1 )

MEMORY_EXCEEDED = ($(MEMORY_EXCEEDED_SLOT)) || ($(MEMORY_EXCEEDED_REQ))
# Return 137 for memory overise
use POLICY : WANT_HOLD_IF(MEMORY_EXCEEDED, 137, memory usage exceeded available memory)

# If Memory Exceded, Evict job
PREEMPT = ($(PREEMPT)) || ($(MEMORY_EXCEEDED))

# Suspend and hold the job
WANT_SUSPEND = ($(WANT_SUSPEND)) && ($(MEMORY_EXCEEDED)) =!= TRUE
WANT_HOLD = ($(MEMORY_EXCEEDED))
# Message to Job's owner
WANT_HOLD_REASON = ifThenElse( $(WANT_HOLD), "Job exceeded available memory.", undefined )

```

We apply this policy in the HTCondor configuration file for the STARTD of a SLURM worker compute node.
However, our test jobs that used more memory than the `request_memory` were finished, though, and we discovered that the policy was ineffective.

We've also tried to directly inject the below
```
periodic_remove = (MemoryUsage > request_memory * 1.1)
```
to the HTCondor job description file but found nothing was working.

This is an example job description file
```
JobBatchName = cromwell_9d037456_samtools
Iwd=/.../call-samtools/execution
+Owner=UNDEFINED
request_memory=2048.0
request_disk=25600.0
request_cpus=1
request_gpus = 0
error=/..../call-samtools/execution/stderr
output=/.../call-samtools/execution/stdout
log_xml=true
executable=/.../call-samtools/execution/dockerScript
log=/.../call-samtools/execution/execution.log
environment="JAWS_JOB_ID=$(CLUSTER), JAWS_CROMWELL_ID=9d037456-e221-4761-adf5-af7f08a21d34"
requirements=isUndefined(TARGET.AliveUntil) ? TRUE : TARGET.AliveUntil > time() + (10 * 60)
periodic_remove=(MemoryUsage > request_memory * 1.1)
queue
```

Â

FYI, this is one of the tests memory usage info
```
Requested memory size: 400 GB
Requested runtime memory: 120 GB
Total Memory: 503.49 GB
Used Memory: 32.82 GB
Available Memory: 466.00 GB
Allocating 400 GB of RAM.
Successfully allocated 400 GB of RAM.
Sleeping for 10 minutes.
Current memory usage: 399.95 GB
```
, where the `request_memory` is 120GB and the consumed memory space is 400GB but the job was not put into HOLD.

Anything we did miss? Any ideas?

Thank you!

Best,



Seung-Jin Sul, Ph.D.
pronouns: he / his
510-495-8456ÂÂ Â | Â Âssul@xxxxxxx
Staff Software Developer, Tech Lead
Advanced Analysis Group
DOE Joint Genome Institute
Lawrence Berkeley National Lab

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/