[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] MODIFY_REQUEST_EXPR "error" and persistent dynamic slot



Hi,

we have once again a problem, I cannot figure out on my own and have not found much about this in the archives.

We are still on the 8.8 branch (8.8.9-1 to be precise and as things just worked, we were too lazy to upgrade, but at least the same happens with 8.8.17 on the execute node).

A user submitted large jobs into our "parallel" pool, i.e. we only want JobUniverse 11 there with the sole submit host being the DedicatedScheduler.

The job itself wants to run alone on a 128 logical core machine with 512GByte RAM and after I added a "fresh" box to the pool, the job matches to the partitionable slot but something goes awry afterwards:

Job 151235.000 defines the following attributes:
    DiskUsage = 2
    FileSystemDomain = "atlas.local"
    RequestCpus = 64
    RequestDisk = DiskUsage
    RequestMemory = 495616

In the StartLog I find

10/11/22 08:36:23 slot1_1: New machine resource of type -1 allocated
10/11/22 08:36:23 Setting up slot pairings
10/11/22 08:36:23 slot1_1: Request accepted.
10/11/22 08:36:23 slot1_1: Remote owner is USER@xxxxxxxxxxx
10/11/22 08:36:23 slot1_1: State change: claiming protocol successful
10/11/22 08:36:23 slot1_1: Changing state: Owner -> Claimed
10/11/22 08:36:23 Job no longer matches partitionable slot after MODIFY_REQUEST_EXPR_ edits, retrying w/o edits 10/11/22 08:36:23 slot1: Partitionable slot can't be split to allocate a dynamic slot large enough for the claim
10/11/22 08:36:23 slot1: State change: claiming protocol failed
10/11/22 08:36:23 slot1: Changing state: Unclaimed -> Owner
10/11/22 08:36:23 slot1: State change: IS_OWNER is false
10/11/22 08:36:23 slot1: Changing state: Owner -> Unclaimed

10/11/22 08:46:51 Can't read ClaimId
10/11/22 08:46:51 condor_write(): Socket closed when trying to write 29 bytes to <10.20.30.23:24729>, fd is 9
10/11/22 08:46:51 Buf::write(): condor_write() failed

But then it leave behind a dynamic slot with the requested sizes

condor_status a7301.atlas.local -af Cpus Memory State Activity Name
64 12026 Unclaimed Idle slot1@xxxxxxxxxxxxxxxxx
64 495616 Claimed Idle slot1_1@xxxxxxxxxxxxxxxxx

On the execute node, the MODIFY rules are:
condor_config_val -dump|grep MODIFY_REQUEST_EXPR
MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus,{1})
MODIFY_REQUEST_EXPR_REQUESTDISK = quantize(RequestDisk,{1024})
MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory,{128})
MUST_MODIFY_REQUEST_EXPRS = false

So nothing really surprising here. I've tried condor_qediting varous of the request lines, but so far to no avail :(.

Anyone with an idea, why these jobs don't start and how I could fix the problem?

Cheers and thanks in advance!

Carsten

--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185