[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] periodic_remove of memory overuse job is not working



Hi,

what does condor think about the memory usage of the job in question - what does condor_history <jobid> -af memoryusage tell ?

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Seung-Jin Sul" <ssul@xxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
CC: "Ramani Kothadia" <ramanik@xxxxxxx>, "Daniela Cassol" <dcassol@xxxxxxx>
Gesendet: Montag, 10. Februar 2025 20:55:01
Betreff: Re: [HTCondor-users] periodic_remove of memory overuse job is not        working

Hi,

So I tested by adding this policy to both the HTCondor server configuration and the HTCondor worker configuration (which will be started on a SLURM compute node).
```
HoldOverMem = (ifThenElse(ResidentSetSize =!= UNDEFINED, ResidentSetSize,1) > 1100 * RequestMemory)
HoldOverMemReason = "Memory usage higher than 1.1 x requested memory"
SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD) || $(HoldOverMem)
SYSTEM_PERIODIC_HOLD_REASON = ifThenElse($(HoldOverMem), $(HoldOverMemReason), \
    $(SYSTEM_PERIODIC_HOLD_REASON) )
```

Even I tested with
``
USE_CGROUPS = True
CGROUP_MEMORY_LIMIT_POLICY = hard
``
but nothing changed. The job was completed successfully without interruption.

I've confirmed the process uses more memory space as we expected.
```
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 ...
 109887 jaws      20   0   56.0g  53.9g   3380 S   0.0  85.8   0:20.06 python3
```


Anything I can test more?

Thank you!

On Fri, Feb 7, 2025 at 12:50âPM Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
Hi,

on the sched you could use something like:

HoldOverMem = (ifThenElse(ResidentSetSize =!= UNDEFINED, ResidentSetSize,1) > 1100 * RequestMemory)
HoldOverMemReason = "Memory usage higher than 1.1 x requested memory"
SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD) || $(HoldOverMem)
SYSTEM_PERIODIC_HOLD_REASON = ifThenElse($(HoldOverMem), $(HoldOverMemReason), \
    $(SYSTEM_PERIODIC_HOLD_REASON) )

Best
christoph
--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Seung-Jin Sul" <ssul@xxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>, "Ramani Kothadia" <ramanik@xxxxxxx>, "Daniela Cassol" <dcassol@xxxxxxx>
Gesendet: Freitag, 7. Februar 2025 20:00:45
Betreff: [HTCondor-users] periodic_remove of memory overuse job is not        working

Hi,

We are using HTCondor with a SLURM glide-in and are trying to set the memory usage limit policy like the below
```
# Check jobs if they are using more than 10% over memory assigned to the slot
MEMORY_EXCEEDED_SLOT = ( isDefined(MemoryUsage) && isDefined(Memory) && MemoryUsage*1.1 > Memory )
# Check jobs if they are using more than 10% over the requested memory
MEMORY_EXCEEDED_REQ = ( isDefined(MemoryUsage) && isDefined(RequestMemory) && MemoryUsage > RequestMemory*1.1 )

MEMORY_EXCEEDED = ($(MEMORY_EXCEEDED_SLOT)) || ($(MEMORY_EXCEEDED_REQ))
# Return 137 for memory overise
use POLICY : WANT_HOLD_IF(MEMORY_EXCEEDED, 137, memory usage exceeded available memory)

# If Memory Exceded, Evict job
PREEMPT = ($(PREEMPT)) || ($(MEMORY_EXCEEDED))

# Suspend and hold the job
WANT_SUSPEND = ($(WANT_SUSPEND)) && ($(MEMORY_EXCEEDED)) =!= TRUE
WANT_HOLD = ($(MEMORY_EXCEEDED))
# Message to Job's owner
WANT_HOLD_REASON = ifThenElse( $(WANT_HOLD), "Job exceeded available memory.", undefined )

```

We apply this policy in the HTCondor configuration file for the STARTD of a SLURM worker compute node.
However, our test jobs that used more memory than the `request_memory` were finished, though, and we discovered that the policy was ineffective.

We've also tried to directly inject the below
```
periodic_remove = (MemoryUsage > request_memory * 1.1)
```
to the HTCondor job description file but found nothing was working.

This is an example job description file
```
JobBatchName = cromwell_9d037456_samtools
Iwd=/.../call-samtools/execution
+Owner=UNDEFINED
request_memory=2048.0
request_disk=25600.0
request_cpus=1
request_gpus = 0
error=/..../call-samtools/execution/stderr
output=/.../call-samtools/execution/stdout
log_xml=true
executable=/.../call-samtools/execution/dockerScript
log=/.../call-samtools/execution/execution.log
environment="JAWS_JOB_ID=$(CLUSTER), JAWS_CROMWELL_ID=9d037456-e221-4761-adf5-af7f08a21d34"
requirements=isUndefined(TARGET.AliveUntil) ? TRUE : TARGET.AliveUntil > time() + (10 * 60)
periodic_remove=(MemoryUsage > request_memory * 1.1)
queue
```

 

FYI, this is one of the tests memory usage info
```
Requested memory size: 400 GB
Requested runtime memory: 120 GB
Total Memory: 503.49 GB
Used Memory: 32.82 GB
Available Memory: 466.00 GB
Allocating 400 GB of RAM.
Successfully allocated 400 GB of RAM.
Sleeping for 10 minutes.
Current memory usage: 399.95 GB
```
, where the `request_memory` is 120GB and the consumed memory space is 400GB but the job was not put into HOLD.

Anything we did miss? Any ideas?

Thank you!

Best,



Seung-Jin Sul, Ph.D.
pronouns: he / his
510-495-8456     |    ssul@xxxxxxx
Staff Software Developer, Tech Lead
Advanced Analysis Group
DOE Joint Genome Institute
Lawrence Berkeley National Lab

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/