Re: [HTCondor-users] SYSTEM_PERIODIC

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

On Mon, Aug 23, 2021 at 12:27 PM Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx> wrote:

Hello,

i finally tested the method, and it turns out that it only works by defining SYSTEM_PERIODIC_HOLD_REASON
all at once, this way:

SYSTEM_PERIODIC_HOLD_REASON = {"","message1", "message2", "..."}[max(int($(condition1)), int($(condition2)), int($(condition3)))]

The initial plan of defining MyHoldReason = {"", ... } and then

SYSTEM_PERIODIC_HOLD_REASON = $MyHoldReason[max(...)]
does not work. Not sure if some syntactic adjustment could help.

Stefano

On 22/08/21 13:01, David Cohen wrote:
Hi,

I followed Stefano's example (with Jaime's correction), and created the following:

CPU_EXCEEDED = (CpusUsage > Cpus + 0.8 + (Cpus * 0.1))
MEMORY_EXCEEDED = ( ResidentSetSize > 1024 * RequestMemory )
TIME_EXCEEDED = (Time() - JobCurrentStartExecutingDate) > IfthenElse(HiMemUser && (RequestMemory > 40*1024), 120*3600 , 72*3600)
MyHoldReason = {"","MEMORY_EXCEEDED eval to True", "TIME_EXCEEDED eval to True", "CPU_EXCEEDED eval to True"}
SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD:False) || $(MEMORY_EXCEEDED) || $(TIME_EXCEEDED) || $(CPU_EXCEEDED)
SYSTEM_PERIODIC_HOLD_REASON = $(MyHoldReason)[max({int($(MEMORY_EXCEEDED))*1,int($(TIME_EXCEEDED))*2,int($(CPU_EXCEEDED))*3})]

I first tried to add to the startd config and run condor_reconfig, the overtime job wasn't removed, then on the schedd with the same result.

When I had only Time Rule it was on the schedd. The CPU and Mem rules, that end up conflicting with the schedd SYSTEM_PERIODIC_HOLD are from the startd.

So maybe my failure is the attempt to combine them in one location.

Any ideas?

David
On Thu, Aug 19, 2021 at 11:40 PM Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
I commend you on your advanced use of ClassAds operators.
You will need to use $() when referencing the TooMuch* parameters in yourÂSYSTEM_PERIODIC_HOLD andÂSYSTEM_PERIODIC_HOLD_REASON values. Since the TooMuch* parameters are config file macros and not ClassAd attributes in the job ads, they need to be expanded at config file parsing/lookup time.
Â- Jaime
On Aug 19, 2021, at 2:57 PM, Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx> wrote:
Hello, i was about to test something similar, by defining the following checks:

TooMuchDiskÂÂ = (DiskUsage_raw > 20 * CpusProvisioned * 1024000)
TooMuchTimeÂÂ = (ServerTime - JobStartDate > 86400 * 7)
TooMuchMemory = (MemoryProvisioned > 6000)
TooMuchImg = ImageSize_RAW/1e6 > 35 * CpusProvisioned

SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD:False) || TooMuchDisk || TooMuchTime || TooMuchMemory || TooMuchImg

MyHoldReason = {"","TooMuchDisk eval to True", "TooMuchTime eval to True", "TooMuchMemory eval to True", "TooMuchImg eval to True"}
SYSTEM_PERIODIC_HOLD_REASON = $(MyHoldReason)[max({int(TooMuchDisk),int(TooMuchTime)*2,int(TooMuchMemory)*3}]

The idea is to define MyHoldReason as an array of strings, and set
SYSTEM_PERIODIC_HOLD_REASON as one string from the array, whose index comes from the boolean values of the checks.

I think this should work, provided that int(True) == 1 Â and int(False) == 0, but have not yet tested it.
Stefano

Il 19/08/21 18:47, Jaime Frey ha scritto:
Itâs a little cumbersome to have multiple hold triggers with distinct reason messages. You need to chain them together manually. Hereâs a pattern to follow to keep it from becoming too confusing:

HOLD_CLAUSE_1 =Â( ResidentSetSize > 1024 * RequestMemory )

HOLD_REASON_1 =Â"Memory usage too high (Trying to use more then requested-memory)â

HOLD_CLAUSE_2 = (Time() - JobCurrentStartDate) > IfthenElse(HiMemUser && (RequestMemory > 40*1024), 120*3600 , 72*3600)
HOLD_REASON_2 =Â"Job Is Running over timeâ

SYSTEM_PERIODIC_HOLD = $(HOLD_CLAUSE_1)) || $(HOLD_CLAUSE_2)

SYSTEM_PERIODIC_HOLD_REASON = $(HOLD_CLAUSE_1) ? $(HOLD_REASON_1) : $(HOLD_REASON_2)

If you have more than two hold expressions, you may to add some parentheses to the SYSTEM_PERIODIC_HOLD_REASON _expression_ to ensure the nested ?: operators evaluate properly.

Â- Jaime

On Aug 19, 2021, at 7:10 AM, David Cohen <cdavid@xxxxxxxxxxxxxxxxxxxxxx> wrote:

Thanks Christoph,

<quote><doublequote><tick>

didn't return results, but:

<doublequote><tick>

did the trick.

It returned 6 jobs that were held due to high memory usage, not for running over time.
That indicated that the following, from the startd configuration is causing the conflict:
SYSTEM_PERIODIC_HOLD = ( ResidentSetSize > 1024 * RequestMemory )
SYSTEM_PERIODIC_HOLD_REASON = "Memory usage too high (Trying to use more then requested-memory)"

What is the proper way to create multiple SYSTEM_PERIODIC_HOLD without them conflicting with each other?

On Thu, Aug 19, 2021 at 2:16 PM Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:

Hi,

try

condor_q -all -nobatch -constraint '"`condor_config_val SYSTEM_PERIODIC_HOLD`"'

(<quote><doublequote><tick>)

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Von: "David Cohen" <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Donnerstag, 19. August 2021 12:59:15
Betreff: Re: [HTCondor-users] SYSTEM_PERIODIC_HOLD ignored

Thanks Jaime for your reply,

condor_q -all -nobatch -constraint `condor_config_val SYSTEM_PERIODIC_HOLD`

-- Parse error in constraint _expression_ "("

Looking at a job that should have been put on hold:

HiMemUser = 0
RequestMemory = 5120
JobCurrentStartDate = 1628598643 Â Â## ÂTime() - 1628598643 > 72*3600 - Assuming Time() is working properly and returning the time as Epoch value.

The error seems to indicate a typo error, but I cannot figure it out.

All the arguments that need to be evaluated are present and have the expected values.

On Wed, Aug 18, 2021 at 12:03 AM Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

I canât think of anything that would normally cause a periodic hold _expression_ to stop working.
Here are a couple of ideas for debugging the problemâ

When thereâs a job in the queue that you think should be affected by the periodic hold _expression_, try running this command:
condor_q -all -nobatch -constraint `condor_config_val SYSTEM_PERIODIC_HOLD`

If that doesnât display the problematic job(s), try altering the _expression_ (removing or adjusting terms) to see whatâs needed to make the jobs appear. That can reveal differences between what youâre checking for and whatâs in the job ads.

To ensure the schedd is evaluating the periodic job expressions on a timely basis, you can try amending the _expression_ to always hold special test jobs. For example, you can add this to the end of your config files:
SYSTEM_PERIODIC_HOLD = ($SYSTEM_PERIODIC_HOLD) || AdminHoldJob=?=true

Then, submit a test job with the following line in the submit file:
+AdminHoldJob=True

Then, wait and see if the job gets held.

Â- Jaime

> On Aug 17, 2021, at 5:09 AM, David Cohen <cdavid@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi,
> A SYSTEM_PERIODIC_HOLD, configure on the schedd, that used to work is ignored lately:
>
> SYSTEM_PERIODIC_HOLD = (Time() - JobCurrentStartDate) > IfthenElse(HiMemUser && (RequestMemory > 40*1024), 120*3600 , 72*3600)
> SYSTEM_PERIODIC_HOLD_Reason = "Job Is Running over time"
> SYSTEM_PERIODIC_REMOVE = JobStatus == 5 && (Time() - EnteredCurrentStatus) > 600
>
> I could find no reference to that in the system's log.
> How can I debug that?
>
> Best,
> David

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] SYSTEM_PERIODIC_HOLD ignored