On 5/27/2022 8:08 AM, Carles Acosta
wrote:
Dear all,
We have a test execute machine with 100 partitionable slots of
48 cores and 2 GB RAM per core. Everything is fake, just for
testing purposes and running sleep commands. We are using
HTCondor 9.0.12 in this test environment.Â
We are doing differentÂtests submitting a batch of 100 jobs
and changing the number of requested cpus to this execute
machine. When we submit 100 jobs requesting 48 cores, the 100
jobs start in the first negotiator cycle. When we submit 100
jobs of 24 cores, 75 jobs start in the first cycle, with 8
cores, 31 jobs in the first cycle, and with 1 core only 6 jobs
start in the first cycle.
We have been looking in the manual for negotiator, schedd,
startd, or any configuration variable that explains this
behavior but we were not lucky. Is there any way to enforce
for instance that all the jobs enter in the first negotiator
cycle even if requesting one CPU? Our guess is that there are
some configuration and timeouts regarding the creation of the
dynamic slots, etc., that are affecting this case, or maybe
this is related to the auto clustering on the negotiator side?
Thank you in advance and have a nice weekend!
Carles
Hi Carles,
Something certainly seems amiss in your setup, or your test execute
machine is having resource contention problems...
I tried to reproduce your environment by installing minicondor
v9.0.13 into a Centos7 docker container (with 6 cores and 12GB ram),
configured the startd with 100 pslots of 48cores each, and
submitting 100 jobs each requesting 1 core. In my test, all 100
jobs got matched in the first negotiation cycle.ÂÂ Details of how I
did my test are below (*).
In my test, almost all configuration knobs were just using the
default settings. It may help to run "condor_config_val -summary" on
your central manager and perhaps your access point (submit
machine). This command will output all your customized config
changes, i.e. settings that have been customized away from the
default settings. This may give a clue. If you are willing to
share the output of this command here (maybe sanitize hostnames if
desired), we could also look it over for anything suspicious.
Hope this helps,
Todd
(*) Pithy testing procedure I tried to reproduce the problem:
( fire up a minimal empty Centos 7 container for testing; all
subsequent commands are in the container )
$ docker run --rm -it centos:7
( next install minicondor from the stable (v9.0.x) channel in the
container )
# curl -fsSL https://get.htcondor.org |
/bin/bash -s -- --no-dry-run --channel stable
( configure startd with 100 48-core pslots, and set
negotiator_interval to be huge so
 only one negotiation cycle will take place after submitting jobs )
# cat - >
/etc/condor/config.d/05-test.conf
NUM_CPUS = 4800
MEMORY = 100 * 2048
SLOT_TYPE_1 = memory=2048, cpus=48
SLOT_TYPE_1_PARTITIONABLE = true
NUM_SLOTS_TYPE_1 = 100
NEGOTIATOR_INTERVAL = 5000
<CTRL-D>
(start up htcondor)
# condor_master
(become a non-root user to submit test sleep jobs)
# adduser tannenba
# su - tannenba
(submit 100 1-core sleep jobs)
$ condor_submit executable=/usr/bin/sleep arguments=120
request_cpus=1 -queue 100
Submitting
job(s)....................................................................................................
100 job(s) submitted to cluster 3.
(after a few seconds and one negotiation cycle, all jobs are
running... )
$ condor_q
-- Schedd: a5896b768045 : <127.0.0.1:9618?... @ 05/27/22
18:18:17
OWNERÂÂÂ BATCH_NAMEÂÂÂ SUBMITTEDÂÂ DONEÂÂ RUNÂÂÂ IDLEÂ TOTAL
JOB_IDS
tannenba ID: 3ÂÂÂÂÂÂÂ 5/27 18:18ÂÂÂÂÂ _ÂÂÂ 100ÂÂÂÂÂ _ÂÂÂ 100
3.0-99
Total for query: 100 jobs; 0 completed, 0 removed, 0 idle, 100
running, 0 held, 0 suspended
Total for tannenba: 100 jobs; 0 completed, 0 removed, 0 idle, 100
running, 0 held, 0 suspended
Total for all users: 100 jobs; 0 completed, 0 removed, 0 idle, 100
running, 0 held, 0 suspended