[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] MODIFY_REQUEST_EXPR "error" and persistent dynamic slot



Hi Carsten,

If you add  D_MATCH:2 to STARTD_DEBUG on the execute node then 8.8 will print the full job and slot classads when it hits the case where the "Job no longer matches partitionable slot after...".  You can then save those ads to files and try sending those ads through  condor_q -better-analyze  using the -jobads and -slotads arguments to pass the job and slot files.

STARTD_DEBUG = $( STARTD_DEBUG) D_CAT D_MATCH:2

If you can upgrade the execute node to 9.0.x or 9.x, then it will do that sort of matchmaking analysis inside the schedd and print the analysis. (this feature was added in 8.9.7). 

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Carsten Aulbert
Sent: Tuesday, October 11, 2022 6:22 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] MODIFY_REQUEST_EXPR "error" and persistent dynamic slot

Hi,

we have once again a problem, I cannot figure out on my own and have not 
found much about this in the archives.

We are still on the 8.8 branch (8.8.9-1 to be precise and as things just 
worked, we were too lazy to upgrade, but at least the same happens with 
8.8.17 on the execute node).

A user submitted large jobs into our "parallel" pool, i.e. we only want 
JobUniverse 11 there with the sole submit host being the DedicatedScheduler.

The job itself wants to run alone on a 128 logical core machine with 
512GByte RAM and after I added a "fresh" box to the pool, the job 
matches to the partitionable slot but something goes awry afterwards:

Job 151235.000 defines the following attributes:
     DiskUsage = 2
     FileSystemDomain = "atlas.local"
     RequestCpus = 64
     RequestDisk = DiskUsage
     RequestMemory = 495616

In the StartLog I find

10/11/22 08:36:23 slot1_1: New machine resource of type -1 allocated
10/11/22 08:36:23 Setting up slot pairings
10/11/22 08:36:23 slot1_1: Request accepted.
10/11/22 08:36:23 slot1_1: Remote owner is USER@xxxxxxxxxxx
10/11/22 08:36:23 slot1_1: State change: claiming protocol successful
10/11/22 08:36:23 slot1_1: Changing state: Owner -> Claimed
10/11/22 08:36:23 Job no longer matches partitionable slot after 
MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
10/11/22 08:36:23 slot1: Partitionable slot can't be split to allocate a 
dynamic slot large enough for the claim
10/11/22 08:36:23 slot1: State change: claiming protocol failed
10/11/22 08:36:23 slot1: Changing state: Unclaimed -> Owner
10/11/22 08:36:23 slot1: State change: IS_OWNER is false
10/11/22 08:36:23 slot1: Changing state: Owner -> Unclaimed

10/11/22 08:46:51 Can't read ClaimId
10/11/22 08:46:51 condor_write(): Socket closed when trying to write 29 
bytes to <10.20.30.23:24729>, fd is 9
10/11/22 08:46:51 Buf::write(): condor_write() failed

But then it leave behind a dynamic slot with the requested sizes

condor_status a7301.atlas.local -af Cpus Memory State Activity Name
64 12026 Unclaimed Idle slot1@xxxxxxxxxxxxxxxxx
64 495616 Claimed Idle slot1_1@xxxxxxxxxxxxxxxxx

On the execute node, the MODIFY rules are:
condor_config_val -dump|grep MODIFY_REQUEST_EXPR
MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus,{1})
MODIFY_REQUEST_EXPR_REQUESTDISK = quantize(RequestDisk,{1024})
MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory,{128})
MUST_MODIFY_REQUEST_EXPRS = false

So nothing really surprising here. I've tried condor_qediting varous of 
the request lines, but so far to no avail :(.

Anyone with an idea, why these jobs don't start and how I could fix the 
problem?

Cheers and thanks in advance!

Carsten

-- 
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/