Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] MODIFY_REQUEST_EXPR "error" and persistent dynamic slot
- Date: Tue, 11 Oct 2022 13:22:20 +0200
- From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
- Subject: [HTCondor-users] MODIFY_REQUEST_EXPR "error" and persistent dynamic slot
Hi,
we have once again a problem, I cannot figure out on my own and have not
found much about this in the archives.
We are still on the 8.8 branch (8.8.9-1 to be precise and as things just
worked, we were too lazy to upgrade, but at least the same happens with
8.8.17 on the execute node).
A user submitted large jobs into our "parallel" pool, i.e. we only want
JobUniverse 11 there with the sole submit host being the DedicatedScheduler.
The job itself wants to run alone on a 128 logical core machine with
512GByte RAM and after I added a "fresh" box to the pool, the job
matches to the partitionable slot but something goes awry afterwards:
Job 151235.000 defines the following attributes:
DiskUsage = 2
FileSystemDomain = "atlas.local"
RequestCpus = 64
RequestDisk = DiskUsage
RequestMemory = 495616
In the StartLog I find
10/11/22 08:36:23 slot1_1: New machine resource of type -1 allocated
10/11/22 08:36:23 Setting up slot pairings
10/11/22 08:36:23 slot1_1: Request accepted.
10/11/22 08:36:23 slot1_1: Remote owner is USER@xxxxxxxxxxx
10/11/22 08:36:23 slot1_1: State change: claiming protocol successful
10/11/22 08:36:23 slot1_1: Changing state: Owner -> Claimed
10/11/22 08:36:23 Job no longer matches partitionable slot after
MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
10/11/22 08:36:23 slot1: Partitionable slot can't be split to allocate a
dynamic slot large enough for the claim
10/11/22 08:36:23 slot1: State change: claiming protocol failed
10/11/22 08:36:23 slot1: Changing state: Unclaimed -> Owner
10/11/22 08:36:23 slot1: State change: IS_OWNER is false
10/11/22 08:36:23 slot1: Changing state: Owner -> Unclaimed
10/11/22 08:46:51 Can't read ClaimId
10/11/22 08:46:51 condor_write(): Socket closed when trying to write 29
bytes to <10.20.30.23:24729>, fd is 9
10/11/22 08:46:51 Buf::write(): condor_write() failed
But then it leave behind a dynamic slot with the requested sizes
condor_status a7301.atlas.local -af Cpus Memory State Activity Name
64 12026 Unclaimed Idle slot1@xxxxxxxxxxxxxxxxx
64 495616 Claimed Idle slot1_1@xxxxxxxxxxxxxxxxx
On the execute node, the MODIFY rules are:
condor_config_val -dump|grep MODIFY_REQUEST_EXPR
MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus,{1})
MODIFY_REQUEST_EXPR_REQUESTDISK = quantize(RequestDisk,{1024})
MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory,{128})
MUST_MODIFY_REQUEST_EXPRS = false
So nothing really surprising here. I've tried condor_qediting varous of
the request lines, but so far to no avail :(.
Anyone with an idea, why these jobs don't start and how I could fix the
problem?
Cheers and thanks in advance!
Carsten
--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185