_______________________________________________Hi,what does condor think about the memory usage of the job in question - what does condor_history <jobid> -af memoryusage tell ?Bestchristoph
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxxVon: "Seung-Jin Sul" <ssul@xxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
CC: "Ramani Kothadia" <ramanik@xxxxxxx>, "Daniela Cassol" <dcassol@xxxxxxx>
Gesendet: Montag, 10. Februar 2025 20:55:01
Betreff: Re: [HTCondor-users] periodic_remove of memory overuse job is not workingHi,So I tested by adding this policy to both the HTCondor server configuration and the HTCondor worker configuration (which will be started on a SLURM compute node).```HoldOverMem = (ifThenElse(ResidentSetSize =!= UNDEFINED, ResidentSetSize,1) > 1100 * RequestMemory)
HoldOverMemReason = "Memory usage higher than 1.1 x requested memory"SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD) || $(HoldOverMem)SYSTEM_PERIODIC_HOLD_REASON = ifThenElse($(HoldOverMem), $(HoldOverMemReason), \
$(SYSTEM_PERIODIC_HOLD_REASON) )```Even I tested with``USE_CGROUPS = True
CGROUP_MEMORY_LIMIT_POLICY = hard``but nothing changed. The job was completed successfully without interruption.I've confirmed the process uses more memory space as we expected.```PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
...
109887 jaws 20 0 56.0g 53.9g 3380 S 0.0 85.8 0:20.06 python3```Anything I can test more?Thank you!On Fri, Feb 7, 2025 at 12:50âPM Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:_______________________________________________Hi,on the sched you could use something like:HoldOverMem = (ifThenElse(ResidentSetSize =!= UNDEFINED, ResidentSetSize,1) > 1100 * RequestMemory)
HoldOverMemReason = "Memory usage higher than 1.1 x requested memory"SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD) || $(HoldOverMem)SYSTEM_PERIODIC_HOLD_REASON = ifThenElse($(HoldOverMem), $(HoldOverMemReason), \
$(SYSTEM_PERIODIC_HOLD_REASON) )Bestchristoph
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxxVon: "Seung-Jin Sul" <ssul@xxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>, "Ramani Kothadia" <ramanik@xxxxxxx>, "Daniela Cassol" <dcassol@xxxxxxx>
Gesendet: Freitag, 7. Februar 2025 20:00:45
Betreff: [HTCondor-users] periodic_remove of memory overuse job is not workingHi,We are using HTCondor with a SLURM glide-in and are trying to set the memory usage limit policy like the below```# Check jobs if they are using more than 10% over memory assigned to the slot
MEMORY_EXCEEDED_SLOT = ( isDefined(MemoryUsage) && isDefined(Memory) && MemoryUsage*1.1 > Memory )
# Check jobs if they are using more than 10% over the requested memory
MEMORY_EXCEEDED_REQ = ( isDefined(MemoryUsage) && isDefined(RequestMemory) && MemoryUsage > RequestMemory*1.1 )
MEMORY_EXCEEDED = ($(MEMORY_EXCEEDED_SLOT)) || ($(MEMORY_EXCEEDED_REQ))
# Return 137 for memory overise
use POLICY : WANT_HOLD_IF(MEMORY_EXCEEDED, 137, memory usage exceeded available memory)
# If Memory Exceded, Evict job
PREEMPT = ($(PREEMPT)) || ($(MEMORY_EXCEEDED))
# Suspend and hold the job
WANT_SUSPEND = ($(WANT_SUSPEND)) && ($(MEMORY_EXCEEDED)) =!= TRUE
WANT_HOLD = ($(MEMORY_EXCEEDED))
# Message to Job's owner
WANT_HOLD_REASON = ifThenElse( $(WANT_HOLD), "Job exceeded available memory.", undefined )```We apply this policy in the HTCondor configuration file for the STARTD of a SLURM worker compute node.However, our test jobs that used more memory than the `request_memory` were finished, though, and we discovered that the policy was ineffective.We've also tried to directly inject the below```periodic_remove = (MemoryUsage > request_memory * 1.1)
```to the HTCondor job description file but found nothing was working.This is an example job description file```JobBatchName = cromwell_9d037456_samtools
Iwd=/.../call-samtools/execution
+Owner=UNDEFINED
request_memory=2048.0
request_disk=25600.0
request_cpus=1
request_gpus = 0
error=/..../call-samtools/execution/stderr
output=/.../call-samtools/execution/stdout
log_xml=true
executable=/.../call-samtools/execution/dockerScript
log=/.../call-samtools/execution/execution.log
environment="JAWS_JOB_ID=$(CLUSTER), JAWS_CROMWELL_ID=9d037456-e221-4761-adf5-af7f08a21d34"
requirements=isUndefined(TARGET.AliveUntil) ? TRUE : TARGET.AliveUntil > time() + (10 * 60)
periodic_remove=(MemoryUsage > request_memory * 1.1)
queue```FYI, this is one of the tests memory usage info```Requested memory size: 400 GB Requested runtime memory: 120 GB Total Memory: 503.49 GB Used Memory: 32.82 GB Available Memory: 466.00 GB Allocating 400 GB of RAM. Successfully allocated 400 GB of RAM. Sleeping for 10 minutes. Current memory usage: 399.95 GB```, where the `request_memory` is 120GB and the consumed memory space is 400GB but the job was not put into HOLD.Anything we did miss? Any ideas?Thank you!Best,Seung-Jin Sul, Ph.D.Advanced Analysis GroupDOE Joint Genome InstituteLawrence Berkeley National Lab
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/