Hi David, condor_reconfig was enough in
my case;
the syntax is very "delicate", i think; i had similar problems
until things started working as expected.
My "take home" experience is that when writing the condition it is
fundamental to prevent
that it evaluates to undefined.
For example consider the _expression_ for CPU_EXCEEDED: when applied
to a running jobs it
should provide a True / False value only; however:
Problem here is that Cpus is not a job classad and always
evaluated to undefined. No running job has a
value for Cpus:
[root@ce06-htc ~]# condor_q -glob -all -cons '(jobstatus == 2)
&& (Cpus =!= undefined)' -af:j Owner
[root@ce06-htc ~]#
Stefano
On 24/08/21 09:44, David Cohen wrote:
Hi,
I changed SYSTEM_PERIODIC_HOLD_REASON to "all at once" as
you suggested.
It seems that condor_reconfig is not enough to apply those
changes.
Not to running jobs, or even new ones. (test jobs still get
the old hold reason).
Is there a way other than draining restarting the startd to
apply these changes?
I first tried to add to the startd config and run
condor_reconfig, the overtime job wasn't removed, then
on the schedd with the same result.
When I had only Time Rule it was on the schedd. The
CPU and Mem rules, that end up conflicting with the
schedd SYSTEM_PERIODIC_HOLD are from the startd.
So maybe my failure is the attempt to combine them
in one location.
Any ideas?
David
On Thu, Aug 19, 2021
at 11:40 PM Jaime Frey <jfrey@xxxxxxxxxxx>
wrote:
I commend you on your advanced use of ClassAds
operators.
You will need to use $() when referencing the
TooMuch* parameters in yourÂSYSTEM_PERIODIC_HOLD
andÂSYSTEM_PERIODIC_HOLD_REASON values. Since the
TooMuch* parameters are config file macros and not
ClassAd attributes in the job ads, they need to be
expanded at config file parsing/lookup time.
MyHoldReason = {"","TooMuchDisk eval to
True", "TooMuchTime eval to True",
"TooMuchMemory eval to True", "TooMuchImg
eval to True"}
SYSTEM_PERIODIC_HOLD_REASON =
$(MyHoldReason)[max({int(TooMuchDisk),int(TooMuchTime)*2,int(TooMuchMemory)*3}]
The idea is to define MyHoldReason as an
array of strings, and set
SYSTEM_PERIODIC_HOLD_REASON as one string
from the array, whose index comes from the
boolean values of the checks.
I think this should work, provided that
int(True) == 1 Â and int(False) == 0, but
have not yet tested it.
Stefano
Il 19/08/21 18:47, Jaime Frey ha
scritto:
Itâs a little
cumbersome to have multiple hold
triggers with distinct reason messages.
You need to chain them together
manually. Hereâs a pattern to follow to
keep it from becoming too confusing:
If you have more than two hold
expressions, you may to add some
parentheses to the
SYSTEM_PERIODIC_HOLD_REASON
_expression_ to ensure the nested ?:
operators evaluate properly.
It returned 6 jobs
that were held due to
high memory usage, not
for running over time.
That indicated that the
following, from the startd
configuration is causing
the conflict:
SYSTEM_PERIODIC_HOLD = (
ResidentSetSize > 1024
* RequestMemory )
SYSTEM_PERIODIC_HOLD_REASON = "Memory usage too high (Trying to use more
then requested-memory)"
What is the proper way to
create multiple
SYSTEM_PERIODIC_HOLD
without them conflicting
with each other?
Looking at a
job that
should have
been put on
hold:
HiMemUser
= 0
RequestMemory
= 5120
JobCurrentStartDate = 1628598643 Â Â## ÂTime() - 1628598643 > 72*3600
- Assuming
Time() is
working
properly and
returning the
time as Epoch
value.
The error
seems to
indicate a
typo error,
but I cannot
figure it out.
All the
arguments that
need to be
evaluated are
present and
have the
expected
values.
On
Wed, Aug 18,
2021 at 12:03
AM Jaime Frey
<jfrey@xxxxxxxxxxx> wrote:
I canât think
of anything
that would
normally cause
a periodic
hold
_expression_ to
stop working.
Here are a
couple of
ideas for
debugging the
problemâ
When thereâs a
job in the
queue that you
think should
be affected by
the periodic
hold
_expression_,
try running
this command:
condor_q -all
-nobatch
-constraint
`condor_config_val
SYSTEM_PERIODIC_HOLD`
If that
doesnât
display the
problematic
job(s), try
altering the
_expression_
(removing or
adjusting
terms) to see
whatâs needed
to make the
jobs appear.
That can
reveal
differences
between what
youâre
checking for
and whatâs in
the job ads.
To ensure the
schedd is
evaluating the
periodic job
expressions on
a timely
basis, you can
try amending
the _expression_
to always hold
special test
jobs. For
example, you
can add this
to the end of
your config
files:
SYSTEM_PERIODIC_HOLD =
($SYSTEM_PERIODIC_HOLD) || AdminHoldJob=?=true
Then, submit a
test job with
the following
line in the
submit file:
+AdminHoldJob=True
Then, wait and
see if the job
gets held.
Â- Jaime
> On Aug
17, 2021, at
5:09 AM, David
Cohen <cdavid@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi,
> A
SYSTEM_PERIODIC_HOLD,
configure on
the schedd,
that used to
work is
ignored
lately:
>
>
SYSTEM_PERIODIC_HOLD
= (Time() -
JobCurrentStartDate)
>
IfthenElse(HiMemUser
&&
(RequestMemory
> 40*1024),
120*3600 ,
72*3600)
>
SYSTEM_PERIODIC_HOLD_Reason
= "Job Is
Running over
time"
>
SYSTEM_PERIODIC_REMOVE
= JobStatus ==
5 &&
(Time() -
EnteredCurrentStatus)
> 600
>
> I could
find no
reference to
that in the
system's log.
> How can I
debug that?
>
> Best,
> David