I first tried to add to the startd config and run
condor_reconfig, the overtime job wasn't removed, then on the
schedd with the same result.
When I had only Time Rule it was on the schedd. The CPU and
Mem rules, that end up conflicting with the schedd
SYSTEM_PERIODIC_HOLD are from the startd.
So maybe my failure is the attempt to combine them in one
location.
Any ideas?
David
On Thu, Aug 19, 2021 at 11:40
PM Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
I commend you on your advanced use of ClassAds operators.
You will need to use $() when referencing the TooMuch*
parameters in yourÂSYSTEM_PERIODIC_HOLD
andÂSYSTEM_PERIODIC_HOLD_REASON values. Since the TooMuch*
parameters are config file macros and not ClassAd
attributes in the job ads, they need to be expanded at
config file parsing/lookup time.
MyHoldReason = {"","TooMuchDisk eval to True",
"TooMuchTime eval to True", "TooMuchMemory eval to
True", "TooMuchImg eval to True"}
SYSTEM_PERIODIC_HOLD_REASON =
$(MyHoldReason)[max({int(TooMuchDisk),int(TooMuchTime)*2,int(TooMuchMemory)*3}]
The idea is to define MyHoldReason as an array of
strings, and set
SYSTEM_PERIODIC_HOLD_REASON as one string from the
array, whose index comes from the boolean values
of the checks.
I think this should work, provided that int(True)
== 1 Â and int(False) == 0, but have not yet
tested it.
Stefano
Il 19/08/21 18:47, Jaime Frey ha scritto:
Itâs a little cumbersome to have multiple hold
triggers with distinct reason messages. You need
to chain them together manually. Hereâs a
pattern to follow to keep it from becoming too
confusing:
If you have more than two hold
expressions, you may to add some
parentheses to the
SYSTEM_PERIODIC_HOLD_REASON _expression_ to
ensure the nested ?: operators evaluate
properly.
It returned 6 jobs that were
held due to high memory usage,
not for running over time.
That indicated that the following,
from the startd configuration is
causing the conflict:
SYSTEM_PERIODIC_HOLD = (
ResidentSetSize > 1024 *
RequestMemory )
SYSTEM_PERIODIC_HOLD_REASON =
"Memory usage too high (Trying to
use more then requested-memory)"
What is the proper way to create
multiple SYSTEM_PERIODIC_HOLD
without them conflicting with each
other?
Looking at a job that
should have been put
on hold:
HiMemUser
= 0
RequestMemory = 5120
JobCurrentStartDate =
1628598643 Â Â##
ÂTime() - 1628598643
> 72*3600 -
Assuming Time() is
working properly and
returning the time as
Epoch value.
The error seems to
indicate a typo error,
but I cannot figure it
out.
All the arguments
that need to be
evaluated are present
and have the expected
values.
On
Wed, Aug 18, 2021 at
12:03 AM Jaime Frey
<jfrey@xxxxxxxxxxx>
wrote:
I canât think of
anything that would
normally cause a
periodic hold
_expression_ to stop
working.
Here are a couple of
ideas for debugging
the problemâ
When thereâs a job in
the queue that you
think should be
affected by the
periodic hold
_expression_, try
running this command:
condor_q -all -nobatch
-constraint
`condor_config_val
SYSTEM_PERIODIC_HOLD`
If that doesnât
display the
problematic job(s),
try altering the
_expression_ (removing
or adjusting terms) to
see whatâs needed to
make the jobs appear.
That can reveal
differences between
what youâre checking
for and whatâs in the
job ads.
To ensure the schedd
is evaluating the
periodic job
expressions on a
timely basis, you can
try amending the
_expression_ to always
hold special test
jobs. For example, you
can add this to the
end of your config
files:
SYSTEM_PERIODIC_HOLD =
($SYSTEM_PERIODIC_HOLD) || AdminHoldJob=?=true
Then, submit a test
job with the following
line in the submit
file:
+AdminHoldJob=True
Then, wait and see if
the job gets held.
Â- Jaime
> On Aug 17, 2021,
at 5:09 AM, David
Cohen <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
wrote:
>
> Hi,
> A
SYSTEM_PERIODIC_HOLD,
configure on the
schedd, that used to
work is ignored
lately:
>
>
SYSTEM_PERIODIC_HOLD =
(Time() -
JobCurrentStartDate)
>
IfthenElse(HiMemUser
&&
(RequestMemory >
40*1024), 120*3600 ,
72*3600)
>
SYSTEM_PERIODIC_HOLD_Reason
= "Job Is Running over
time"
>
SYSTEM_PERIODIC_REMOVE
= JobStatus == 5
&& (Time() -
EnteredCurrentStatus)
> 600
>
> I could find no
reference to that in
the system's log.
> How can I debug
that?
>
> Best,
> David