Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] MODIFY_REQUEST_EXPR "error" and persistent dynamic slot

Date: Tue, 11 Oct 2022 13:22:20 +0200
From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
Subject: [HTCondor-users] MODIFY_REQUEST_EXPR "error" and persistent dynamic slot

Hi,

we have once again a problem, I cannot figure out on my own and have notfound much about this in the archives.

We are still on the 8.8 branch (8.8.9-1 to be precise and as things justworked, we were too lazy to upgrade, but at least the same happens with8.8.17 on the execute node).

A user submitted large jobs into our "parallel" pool, i.e. we only wantJobUniverse 11 there with the sole submit host being the DedicatedScheduler.

The job itself wants to run alone on a 128 logical core machine with512GByte RAM and after I added a "fresh" box to the pool, the jobmatches to the partitionable slot but something goes awry afterwards:


Job 151235.000 defines the following attributes:
    DiskUsage = 2
    FileSystemDomain = "atlas.local"
    RequestCpus = 64
    RequestDisk = DiskUsage
    RequestMemory = 495616

In the StartLog I find

10/11/22 08:36:23 slot1_1: New machine resource of type -1 allocated
10/11/22 08:36:23 Setting up slot pairings
10/11/22 08:36:23 slot1_1: Request accepted.
10/11/22 08:36:23 slot1_1: Remote owner is USER@xxxxxxxxxxx
10/11/22 08:36:23 slot1_1: State change: claiming protocol successful
10/11/22 08:36:23 slot1_1: Changing state: Owner -> Claimed

10/11/22 08:36:23 Job no longer matches partitionable slot afterMODIFY_REQUEST_EXPR_ edits, retrying w/o edits10/11/22 08:36:23 slot1: Partitionable slot can't be split to allocate adynamic slot large enough for the claim

10/11/22 08:36:23 slot1: State change: claiming protocol failed
10/11/22 08:36:23 slot1: Changing state: Unclaimed -> Owner
10/11/22 08:36:23 slot1: State change: IS_OWNER is false
10/11/22 08:36:23 slot1: Changing state: Owner -> Unclaimed

10/11/22 08:46:51 Can't read ClaimId

10/11/22 08:46:51 condor_write(): Socket closed when trying to write 29bytes to <10.20.30.23:24729>, fd is 9

10/11/22 08:46:51 Buf::write(): condor_write() failed

But then it leave behind a dynamic slot with the requested sizes

condor_status a7301.atlas.local -af Cpus Memory State Activity Name
64 12026 Unclaimed Idle slot1@xxxxxxxxxxxxxxxxx
64 495616 Claimed Idle slot1_1@xxxxxxxxxxxxxxxxx

On the execute node, the MODIFY rules are:
condor_config_val -dump|grep MODIFY_REQUEST_EXPR
MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus,{1})
MODIFY_REQUEST_EXPR_REQUESTDISK = quantize(RequestDisk,{1024})
MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory,{128})
MUST_MODIFY_REQUEST_EXPRS = false

So nothing really surprising here. I've tried condor_qediting varous ofthe request lines, but so far to no avail :(.

Anyone with an idea, why these jobs don't start and how I could fix theproblem?


Cheers and thanks in advance!

Carsten

--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185

Follow-Ups:
- Re: [HTCondor-users] MODIFY_REQUEST_EXPR "error" and persistent dynamic slot
  - From: John M Knoeller

Prev by Date: Re: [HTCondor-users] Major upgrade from Singularity to Apptainer coming
Next by Date: Re: [HTCondor-users] MODIFY_REQUEST_EXPR "error" and persistent dynamic slot
Previous by thread: Re: [HTCondor-users] failed to fetch logs
Next by thread: Re: [HTCondor-users] MODIFY_REQUEST_EXPR "error" and persistent dynamic slot
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] MODIFY_REQUEST_EXPR "error" and persistent dynamic slot