Add debug() around your expressions and set STARTD_LOG=D_FULLDEBUG
and it will yield better insight into how the _expression_ is being evaluated and what could be going wrong.
My guess is RSS is not in the Machine.Ad, but the Job.Ad.
ref: http://research.cs.wisc.edu/htcondor/manual/v7.9/4_1HTCondor_s_ClassAd.html#39930
Cheers,
Tim
From: "Hermann Fuchs" <hermann.fuchs@xxxxxxxxxxxxxxxx>
To: htcondor-users@xxxxxxxxxxx
Sent: Thursday, March 7, 2013 4:20:26 AM
Subject: [HTCondor-users] Problem with memory enforcement -> crashing Startd
Hi
I am trying to enforce a memory limit on our cluster.
What I want is for condor to compare the actually used RAM with the requested Memory.
If the job uses more RAM than requested (Memory) the job is to be put on hold.
Starting from the example from the manual I came up with:
#Only Resident Memory (RAM) taken into account
MEMORY_USED_BY_JOB_MB = ResidentSetSize/1024
MEMORY_EXCEEDED = ($(MEMORY_USED_BY_JOB_MB)) > Memory
PREEMPT = ($(PREEMPT)) || ($(MEMORY_EXCEEDED))
WANT_SUSPEND = ($(WANT_SUSPEND)) && ($(MEMORY_EXCEEDED)) =!= TRUE
WANT_HOLD = ( $(MEMORY_EXCEEDED) )
WANT_HOLD_REASON = \
ifThenElse( $(MEMORY_EXCEEDED), \
"Your job exceeded the amount of requested memory on this machine.", \
undefined )
However, this leads to a crashing Startd.
>From the StartLog: It starts with
03/07/13 11:16:42 slot1_1: Can't evaluate PREEMPT in the context of following ads
followed by the ClassAds of the running job(emitted here)
followed by:
03/07/13 11:16:42 ERROR "Can't evaluate PREEMPT" at line 1615 in file /slots/01/dir_12130/userdir/src/condor_startd.V6/Resource.cpp
03/07/13 11:16:42 slot1_1: Changing state and activity: Claimed/Busy -> Preempting/Killing
03/07/13 11:16:42 startd exiting because of fatal exception.
03/07/13 11:17:23 Setting maximum accepts per cycle 8.
03/07/13 11:17:23 ******************************************************
03/07/13 11:17:23 ** condor_startd (CONDOR_STARTD) STARTING UP
03/07/13 11:17:23 ** /usr/sbin/condor_startd
03/07/13 11:17:23 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
03/07/13 11:17:23 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
03/07/13 11:17:23 ** $CondorVersion: 7.8.7 Dec 12 2012 BuildID: 86173 $
03/07/13 11:17:23 ** $CondorPlatform: x86_64_deb_6.0 $
03/07/13 11:17:23 ** PID = 1224
03/07/13 11:17:23 ** Log last touched 3/7 11:16:42
03/07/13 11:17:23 ******************************************************
I would greatly appreciate any help in this regard.
Greetings from Austria,
Hermann
--
-------------
DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology
Department of Radiation Oncology
Medical University Vienna
Währinger Gürtel 18-20
A-1090 Wien
Tel. + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx
|
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/