[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] preemption problems on particular execution nodes



Hi,

we have currently HTCondor v 8.6.5 installed. Our sandbox system has
- one submit host
- one master node working as the collector and negotiator
- and two different execution nodes a3010 (128 cores) and a3001 (32 cores)
  with similar configurations.

We use partitionable slots on the execution nodes:
SLOT_TYPE_1 = ram=438065, swap=0%, cpus=100%
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1_PARTITIONABLE = True

Additionally, we do have two users 1 and 2 with very different EUP
(EUP_1 >> EUP_2) and want to test preemption.

However it does not work always.

I)
We perform the following experiment:
- User 1 starts jobs wich are sheduled on node a3010 (single core jobs).
- All jobs are running.
- User 2 with a way better EUP submits jobs (multicore jobs)
  which ought to preempt running jobs of user 1 on node a3010.

However, even though ALLOW_PSLOT_PREEMPTION = True this does not happen.

II)
We perform the same experiment on node a3001 and see a different result.
The jobs of user 1 are now being preempted and the slots are occupied
by the jobs of user 2.

The negotiator logs with D_FULLDEBUG for experiment I) can be
downloaded here:
https://www.atlas.aei.uni-hannover.de/~fehrmann/Condor/NegotiatorLog_a3010.gz

for experiment II):
https://www.atlas.aei.uni-hannover.de/~fehrmann/Condor/NegotiatorLog_a3001.gz

The configuration of the collector and negotiator node:
https://www.atlas.aei.uni-hannover.de/~fehrmann/Condor/negotiator_collector.txt.gz

The configuration of the schedd:
https://www.atlas.aei.uni-hannover.de/~fehrmann/Condor/schedd.txt.gz

The configuration of execution node a3010:
https://www.atlas.aei.uni-hannover.de/~fehrmann/Condor/startd_a3010.txt.gz

The configuration of execution node a3001:
https://www.atlas.aei.uni-hannover.de/~fehrmann/Condor/startd_a3001.txt.gz

A negotiator config snippet with what we think is important:

PREEMPTION_REQUIREMENTS = True
ALLOW_PSLOT_PREEMPTION = True
PREEMPTION_RANK = (RemoteUserPrio * 1000000) - ifThenElse(isUndefined(TotalJobRunTime), 0, TotalJobRunTime)
NEGOTIATOR_CONSIDER_EARLY_PREEMPTION = True
NEGOTIATOR_CONSIDER_PREEMPTION = true

Interestingly enough, if we set ALLOW_PSLOT_PREEMPTION to False
scenario I works, if user 2 has request_cpus = 1.

Thank you in advance for feedback.

Cheers,
the Atlas team.