[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Problem with memory enforcement -> crashing Startd



Hello

After some further trial and errors, here is the solution I have come up with.
So far it works without problems.

This criteria looks at the maximum amount of RAM used by a job. If it uses more RAM then requested, the job is put on hold.
It does not take into account the virtual memory.

#Only used RAM
MEMORY_USED_BY_JOB_MB = ResidentSetSize/1024
MEMORY_EXCEEDED = ifThenElse(isUndefined(ResidentSetSize), False, ( ($(MEMORY_USED_BY_JOB_MB)) > Memory ))

PREEMPT = ($(PREEMPT)) || ($(MEMORY_EXCEEDED))

WANT_SUSPEND = ($(WANT_SUSPEND)) && ($(MEMORY_EXCEEDED)) =!= TRUE
WANT_HOLD = ( $(MEMORY_EXCEEDED) )
WANT_HOLD_REASON = \
        ifThenElse( $(MEMORY_EXCEEDED), \
                "Your job exceeded the amount of requested memory on this machine.", \
                undefined )

I hope you find this useful.

Best regards,
Hermann

On Fri, 2013-03-08 at 11:20 +0100, Hermann Fuchs wrote:
Hello

Thank you very much for your tip.

It seems the  problem appears only when a job was evicted/not running.
When a job was evicted, the residentSetSize was undefined, which caused the startd to crash...

>From the log
Classad debug: Activity --> Busy
03/08/13 11:15:04 Classad debug: KeyboardIdle --> 447
03/08/13 11:15:04 Classad debug: CpuBusyTime --> 0
03/08/13 11:15:04 Classad debug: SUSPEND --> FALSE
03/08/13 11:15:04 Classad debug: ResidentSetSize --> UNDEFINED
03/08/13 11:15:04 Classad debug: Memory --> 4752
03/08/13 11:15:04 Classad debug: ResidentSetSize --> UNDEFINED
03/08/13 11:15:04 Classad debug: ( ( ( ( ( Activity == "Suspended" ) && ( ( CurrentTime - EnteredCurrentActivity ) > 30 * 60 ) ) || ( SUSPEND ) ) ) || ( ( ResidentSetSize ) > Memory && ResidentSetSize isnt "UNDEFINED" ) ) --> UNDEFINED
03/08/13 11:15:04 slot1_1: Can't evaluate PREEMPT in the context of following ads

Are there any ideas how I could solve this?
Something like an if condition would be great
E.g. something like, if residentSetSize is undefined, do not use it in this classad

Best regards,
Hermann
On Thu, 2013-03-07 at 09:09 -0500, Tim St Clair wrote:
Add debug() around your expressions and set STARTD_LOG=D_FULLDEBUG 


and it will yield better insight into how the _expression_ is being evaluated and what could be going wrong. 


My guess is RSS is not in the Machine.Ad, but the Job.Ad.  


ref: http://research.cs.wisc.edu/htcondor/manual/v7.9/4_1HTCondor_s_ClassAd.html#39930


Cheers,
Tim




From: "Hermann Fuchs" <hermann.fuchs@xxxxxxxxxxxxxxxx>
To: htcondor-users@xxxxxxxxxxx
Sent: Thursday, March 7, 2013 4:20:26 AM
Subject: [HTCondor-users] Problem with memory enforcement -> crashing Startd

Hi

I am trying to enforce a memory limit on our cluster.
What I want is for condor to compare the actually used RAM with the requested Memory.
If the job uses more RAM than requested (Memory) the job is to be put on hold.

Starting from the example from the manual I came up with:

#Only Resident Memory (RAM) taken into account
MEMORY_USED_BY_JOB_MB = ResidentSetSize/1024
MEMORY_EXCEEDED = ($(MEMORY_USED_BY_JOB_MB)) > Memory
PREEMPT = ($(PREEMPT)) || ($(MEMORY_EXCEEDED))
WANT_SUSPEND = ($(WANT_SUSPEND)) && ($(MEMORY_EXCEEDED)) =!= TRUE
WANT_HOLD = ( $(MEMORY_EXCEEDED) )

WANT_HOLD_REASON = \
       ifThenElse( $(MEMORY_EXCEEDED), \
               "Your job exceeded the amount of requested memory on this machine.", \
               undefined )

However, this leads to a crashing Startd.

>From the StartLog: It starts with
03/07/13 11:16:42 slot1_1: Can't evaluate PREEMPT in the context of following ads
followed by the ClassAds of the running job(emitted here)
followed by:
03/07/13 11:16:42 ERROR "Can't evaluate PREEMPT" at line 1615 in file /slots/01/dir_12130/userdir/src/condor_startd.V6/Resource.cpp
03/07/13 11:16:42 slot1_1: Changing state and activity: Claimed/Busy -> Preempting/Killing
03/07/13 11:16:42 startd exiting because of fatal exception.
03/07/13 11:17:23 Setting maximum accepts per cycle 8.
03/07/13 11:17:23 ******************************************************
03/07/13 11:17:23 ** condor_startd (CONDOR_STARTD) STARTING UP
03/07/13 11:17:23 ** /usr/sbin/condor_startd
03/07/13 11:17:23 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
03/07/13 11:17:23 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
03/07/13 11:17:23 ** $CondorVersion: 7.8.7 Dec 12 2012 BuildID: 86173 $
03/07/13 11:17:23 ** $CondorPlatform: x86_64_deb_6.0 $
03/07/13 11:17:23 ** PID = 1224
03/07/13 11:17:23 ** Log last touched 3/7 11:16:42
03/07/13 11:17:23 ******************************************************

I would greatly appreciate any help in this regard.

Greetings from Austria,
Hermann
-- 
-------------
DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology
Department of Radiation Oncology
Medical University Vienna
Währinger Gürtel 18-20
A-1090 Wien

Tel.  + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
-------------
DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology
Department of Radiation Oncology
Medical University Vienna
Währinger Gürtel 18-20
A-1090 Wien

Tel.  + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx