Hi Hermann,
As an alternate -
RHEL6 or later nodes can use kernel cgroups to police RAM usage. This can be enabled with:
BASE_CGROUPS=/condor
(assuming you have cgroups enabled and mounted on your system). Then, you can add:
MEMORY_LIMIT=hard
to have the kernel kill the job when the memory limit hit. The recipe you have below is still poll-based: it'll check the memory limits every X minutes, which leaves plenty of time for a job to take down the node.
Additionally, there's a patch posted to have HTCondor put the job on hold instead of letting the kernel kill the process. Unfortunately, it's languishing in the review status :(
Enjoy!
Brian
Hello
After some further trial and errors, here is the solution I have come up with.
So far it works without problems.
This criteria looks at the maximum amount of RAM used by a job. If it uses more RAM then requested, the job is put on hold.
It does not take into account the virtual memory.
#Only used RAM
MEMORY_USED_BY_JOB_MB = ResidentSetSize/1024
MEMORY_EXCEEDED = ifThenElse(isUndefined(ResidentSetSize), False, ( ($(MEMORY_USED_BY_JOB_MB)) > Memory ))
PREEMPT = ($(PREEMPT)) || ($(MEMORY_EXCEEDED))
WANT_SUSPEND = ($(WANT_SUSPEND)) && ($(MEMORY_EXCEEDED)) =!= TRUE
WANT_HOLD = ( $(MEMORY_EXCEEDED) )
WANT_HOLD_REASON = \
ifThenElse( $(MEMORY_EXCEEDED), \
"Your job exceeded the amount of requested memory on this machine.", \
undefined )
I hope you find this useful.
Best regards,
Hermann
On Fri, 2013-03-08 at 11:20 +0100, Hermann Fuchs wrote:
Hello
Thank you very much for your tip.
It seems the problem appears only when a job was evicted/not running.
When a job was evicted, the residentSetSize was undefined, which caused the startd to crash...
>From the log
Classad debug: Activity --> Busy
03/08/13 11:15:04 Classad debug: KeyboardIdle --> 447
03/08/13 11:15:04 Classad debug: CpuBusyTime --> 0
03/08/13 11:15:04 Classad debug: SUSPEND --> FALSE
03/08/13 11:15:04 Classad debug: ResidentSetSize --> UNDEFINED
03/08/13 11:15:04 Classad debug: Memory --> 4752
03/08/13 11:15:04 Classad debug: ResidentSetSize --> UNDEFINED
03/08/13 11:15:04 Classad debug: ( ( ( ( ( Activity == "Suspended" ) && ( ( CurrentTime - EnteredCurrentActivity ) > 30 * 60 ) ) || ( SUSPEND ) ) ) || ( ( ResidentSetSize ) > Memory && ResidentSetSize isnt "UNDEFINED" ) ) --> UNDEFINED
03/08/13 11:15:04 slot1_1: Can't evaluate PREEMPT in the context of following ads
Are there any ideas how I could solve this?
Something like an if condition would be great
E.g. something like, if residentSetSize is undefined, do not use it in this classad
Best regards,
Hermann
On Thu, 2013-03-07 at 09:09 -0500, Tim St Clair wrote:
Add debug() around your expressions and set STARTD_LOG=D_FULLDEBUG
and it will yield better insight into how the _expression_ is being evaluated and what could be going wrong.
My guess is RSS is not in the Machine.Ad, but the Job.Ad.
ref: http://research.cs.wisc.edu/htcondor/manual/v7.9/4_1HTCondor_s_ClassAd.html#39930
Cheers,
Tim
From: "Hermann Fuchs" <hermann.fuchs@xxxxxxxxxxxxxxxx>
To: htcondor-users@xxxxxxxxxxx
Sent: Thursday, March 7, 2013 4:20:26 AM
Subject: [HTCondor-users] Problem with memory enforcement -> crashing Startd
Hi
I am trying to enforce a memory limit on our cluster.
What I want is for condor to compare the actually used RAM with the requested Memory.
If the job uses more RAM than requested (Memory) the job is to be put on hold.
Starting from the example from the manual I came up with:
#Only Resident Memory (RAM) taken into account
MEMORY_USED_BY_JOB_MB = ResidentSetSize/1024
MEMORY_EXCEEDED = ($(MEMORY_USED_BY_JOB_MB)) > Memory
PREEMPT = ($(PREEMPT)) || ($(MEMORY_EXCEEDED))
WANT_SUSPEND = ($(WANT_SUSPEND)) && ($(MEMORY_EXCEEDED)) =!= TRUE
WANT_HOLD = ( $(MEMORY_EXCEEDED) )
WANT_HOLD_REASON = \
ifThenElse( $(MEMORY_EXCEEDED), \
"Your job exceeded the amount of requested memory on this machine.", \
undefined )
However, this leads to a crashing Startd.
>From the StartLog: It starts with
03/07/13 11:16:42 slot1_1: Can't evaluate PREEMPT in the context of following ads
followed by the ClassAds of the running job(emitted here)
followed by:
03/07/13 11:16:42 ERROR "Can't evaluate PREEMPT" at line 1615 in file /slots/01/dir_12130/userdir/src/condor_startd.V6/Resource.cpp
03/07/13 11:16:42 slot1_1: Changing state and activity: Claimed/Busy -> Preempting/Killing
03/07/13 11:16:42 startd exiting because of fatal exception.
03/07/13 11:17:23 Setting maximum accepts per cycle 8.
03/07/13 11:17:23 ******************************************************
03/07/13 11:17:23 ** condor_startd (CONDOR_STARTD) STARTING UP
03/07/13 11:17:23 ** /usr/sbin/condor_startd
03/07/13 11:17:23 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
03/07/13 11:17:23 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
03/07/13 11:17:23 ** $CondorVersion: 7.8.7 Dec 12 2012 BuildID: 86173 $
03/07/13 11:17:23 ** $CondorPlatform: x86_64_deb_6.0 $
03/07/13 11:17:23 ** PID = 1224
03/07/13 11:17:23 ** Log last touched 3/7 11:16:42
03/07/13 11:17:23 ******************************************************
I would greatly appreciate any help in this regard.
Greetings from Austria,
Hermann
--
-------------
DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology
Department of Radiation Oncology
Medical University Vienna
Währinger Gürtel 18-20
A-1090 Wien
Tel. + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx
|
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
-------------
DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology
Department of Radiation Oncology
Medical University Vienna
Währinger Gürtel 18-20
A-1090 Wien
Tel. + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx
|
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ |