[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] problems with jobs requiring more then 2GB memory



If that is the submit file, then yes.

From: Mihai Ciubancan <ciubancan@xxxxxxxx>
Sent: Monday, June 2, 2025 5:14 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] problems with jobs requiring more then 2GB memory
 
Hello,

Thank you TJ for your answer!
Taking in consideration that the jobs are submitted through
submit-condor-job(ARC-CE), this is the file that should be modified to
allow jobs that need more than 2GB of memory , right?

Best,
Mihai


On 2025-05-30 20:04, John M Knoeller via HTCondor-users wrote:
> The job is running out of memory because it is only requesting 2Gb of
> RAM but then using more than that.
>
>  SLOT_TYPE_1_PARTITIONABLE=TRUE
>
>  Means that a slot with the amount of cpus and memory requested by the
> job will be created when AP decides to run that job, up to a maximum
> of 8 CPUs and 4 GB, because
>
>  SLOT_TYPE_1=cpus=8, memory=4096
>
>  To fix this, you need to change the request_memory of the job's
> submit file to request more memory
>
>  -tj
>
> -------------------------
>
>  From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf
> of Mihai Ciubancan <ciubancan@xxxxxxxx>
> Sent: Friday, May 30, 2025 2:28 AM
> To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] problems with jobs requiring more then 2GB
> memory
>
> Hello,
>
> I encounter issues with LHCb jobs ,which are requiring more than 2GB
> per
> jobs. The jobs are failling with the following error:
>
> LastHoldReason = "Error from reserved-LHCb2_5@xxxxxxxxxxxxxx: Job has
> gone over cgroup memory limit of 2048 megabytes. Last measured usage:
> 2033 megabytes.  Consider resubmitting with a higher request_memory."
>
> I have configure partionable slots:
>
> CLAIM_WORKLIFE=3600
> CONTINUE=TRUE
> JOB_RENICE_INCREMENT=10
> KILL=FALSE
> NUM_SLOTS=4
> NUM_SLOTS_TYPE_1=4
> SLOT_TYPE_1_PARTITIONABLE=TRUE
> SLOT_TYPE_1=cpus=8, memory=4096
> SLOT_TYPE_1_START=Owner=="pillhcb01"
> SLOT_TYPE_1_NAME_PREFIX=reserved-LHCb
> PREEMPT=FALSE
> RANK=0
> SUSPEND=FALSE
> SLOT_TYPE_1_CONSUMPTION_POLICY=False
> CONSUMPTION_POLICY=False
> CLAIM_PARTITIONABLE_LEFTOVERS=False
>
> Also is enable cgroup policy:
>
> BASE_CGROUP = /system.slice/condor.service
> CGROUP_MEMORY_LIMIT_POLICY = soft
> MAXJOBRETIREMENTTIME = $(HOUR) * 24 * 7
> SYSTEM_PERIODIC_REMOVE =  ResidentSetSize > 3000*RequestMemory
>
> If you have any suggestion will be highly appreciated!
>
> Best,
> Mihai
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
>
> Join us in June at Throughput Computing 25:
> https://urldefense.com/v3/__https://osg-htc.org/htc25__;!!Mak6IKo!K29kgDu3KqY-v0JvPE9cVXxO9hKbX4vVgC2pMuc85_5TCTwv4huZH_KU-ElZEvUc6BvAtLM_1S1Sk8MicXaY$
> [1]
>
> The archives can be found at:
> https://www-auth.cs.wisc.edu/lists/htcondor-users/  [2]
>
>
> Links:
> ------
> [1]
> https://urldefense.com/v3/__https://osg-htc.org/htc25__;!!Mak6IKo!K29kgDu3KqY-v0JvPE9cVXxO9hKbX4vVgC2pMuc85_5TCTwv4huZH_KU-ElZEvUc6BvAtLM_1S1Sk8MicXaY$
> [2] https://www-auth.cs.wisc.edu/lists/htcondor-users/
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
>
> Join us in June at Throughput Computing 25: https://urldefense.com/v3/__https://osg-htc.org/htc25__;!!Mak6IKo!P7WuolxydAg-X-1OmGYbRBGp4oZ__j_4CvVWOoiVWEMdikcLzzEwy1nQGgf6iP2-uurJZCT3nN-1PEZcElNK$
>
> The archives can be found at:
> https://www-auth.cs.wisc.edu/lists/htcondor-users/