[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] periodic_remove of memory overuse job is not working



Hi,

We are using HTCondor with a SLURM glide-in and are trying to set the memory usage limit policy like the below
```
# Check jobs if they are using more than 10% over memory assigned to the slot
MEMORY_EXCEEDED_SLOT = ( isDefined(MemoryUsage) && isDefined(Memory) && MemoryUsage*1.1 > Memory )
# Check jobs if they are using more than 10% over the requested memory
MEMORY_EXCEEDED_REQ = ( isDefined(MemoryUsage) && isDefined(RequestMemory) && MemoryUsage > RequestMemory*1.1 )

MEMORY_EXCEEDED = ($(MEMORY_EXCEEDED_SLOT)) || ($(MEMORY_EXCEEDED_REQ))
# Return 137 for memory overise
use POLICY : WANT_HOLD_IF(MEMORY_EXCEEDED, 137, memory usage exceeded available memory)

# If Memory Exceded, Evict job
PREEMPT = ($(PREEMPT)) || ($(MEMORY_EXCEEDED))

# Suspend and hold the job
WANT_SUSPEND = ($(WANT_SUSPEND)) && ($(MEMORY_EXCEEDED)) =!= TRUE
WANT_HOLD = ($(MEMORY_EXCEEDED))
# Message to Job's owner
WANT_HOLD_REASON = ifThenElse( $(WANT_HOLD), "Job exceeded available memory.", undefined )

```

We apply this policy in the HTCondor configuration file for the STARTD of a SLURM worker compute node.
However, our test jobs that used more memory than the `request_memory` were finished, though, and we discovered that the policy was ineffective.

We've also tried to directly inject the below
```
periodic_remove = (MemoryUsage > request_memory * 1.1)
```
to the HTCondor job description file but found nothing was working.

This is an example job description file
```
JobBatchName = cromwell_9d037456_samtools
Iwd=/.../call-samtools/execution
+Owner=UNDEFINED
request_memory=2048.0
request_disk=25600.0
request_cpus=1
request_gpus = 0
error=/..../call-samtools/execution/stderr
output=/.../call-samtools/execution/stdout
log_xml=true
executable=/.../call-samtools/execution/dockerScript
log=/.../call-samtools/execution/execution.log
environment="JAWS_JOB_ID=$(CLUSTER), JAWS_CROMWELL_ID=9d037456-e221-4761-adf5-af7f08a21d34"
requirements=isUndefined(TARGET.AliveUntil) ? TRUE : TARGET.AliveUntil > time() + (10 * 60)
periodic_remove=(MemoryUsage > request_memory * 1.1)
queue
```

Â

FYI, this is one of the tests memory usage info
```
Requested memory size: 400 GB
Requested runtime memory: 120 GB
Total Memory: 503.49 GB
Used Memory: 32.82 GB
Available Memory: 466.00 GB
Allocating 400 GB of RAM.
Successfully allocated 400 GB of RAM.
Sleeping for 10 minutes.
Current memory usage: 399.95 GB
```
, where the `request_memory` is 120GB and the consumed memory space is 400GB but the job was not put into HOLD.

Anything we did miss? Any ideas?

Thank you!

Best,



Seung-Jin Sul, Ph.D.
pronouns: he / his
510-495-8456ÂÂ Â | Â Âssul@xxxxxxx
Staff Software Developer, Tech Lead
Advanced Analysis Group
DOE Joint Genome Institute
Lawrence Berkeley National Lab