[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Usage of JobStart or ActivationTimer in startd RANK expression?



Hi Jaime,

On 4/1/20 9:32 PM, Jaime Frey wrote:
> JobStart and $(ActivationTimer) are only valid once a job starts running (i.e. the slot is in the Busy activity).
> Thus, they are not suitable for use in a RANK expression.
> $(ActivityTimer) is always valid, and marks the time since the slot changed its Activity value (Busy, Idle, etc).
> Have you looked at the MAXJOBRETIREMENTTIME parameter? This seems like an ideal use for it.
We looked at it, but we possibly misunderstood it as we initially wanted
to "solve" all our preemption only via the negotiator and thought of
MAXJOBRETIREMENTTIME to be relevant for the startd only.

If I may, here are the "rules" we try to implement and currently fail
miserably - do you think those rules are possible to put into a 8.8 or
8.9 pool configuration? For us, we are currently unsure which of these
rules should go into startd configuration and which into the negotiator...

Over the years our pool went from only a single type of machine to a
large collection of different hosts which we will only differentiate to
be "CPU" hosts (ranging from 4 physical cores without HT/SMT to two
socket machines with 64+64 logical cores) and "GPU" hosts (ranging from
small nodes with ancient single GT640/GTX750 card set-ups to recently
added systems with 8 state of the art GPUs).

CPU hosts:

* we guarantee a minimal runtime of each job (wall-clock time!), which
should always be honored
* actual time is locally configured, may differ between hosts and type
of CPU-only hosts and we will try to adjust those to allow smooth
operation for all users (e.g. 50% with 5 hours, 25% with 10 hours, 155
with 20 hours and the rest being unlimited).
* users have to add their expected run time into the submit file for
matching those groups - which in itself may already be hard due to
various different CPUs available

GPU host:

* again, we guarantee a minimal runtime of GPU jobs (wall-clock time!),
which should always be honored
* as above, locally configured min run time
* if CPU resources are available on a GPU host, a CPU job can be started
there, but it does only have a guaranteed runtime against other CPU-only
jobs
* a GPU job is always allowed to preempt a CPU-only job regardless of
runtime (obviously, if a GPU resource is available)

To further complicate the matter:

* interactive jobs:
  * for testing purposes, users may submit interactive jobs (can the
number and run time be limited?)
  * these may request a number of GPU and CPU resources (obviously
within the limits of available machines)
  * due to their interactive nature, these jobs should be matched as
quickly as possible, without breaking the guaranteed runtime rules above
  * CPU-only jobs should only match against non-GPU hosts

* Overall preemption by negotiator should be governed by simple
(default) rule of 20% better effective user prio.

* dedicated scheduler/parallel universe:
  * rarely needed by our users
  * but if needed, high priority a.k.a. not a long waiting time,
possibly by manually adjusting defrag daemon?
  * how to integrate all this with 1-2 dedicated schedulers for subsets
of machines?
  * these jobs once matched should run until done, i.e. no preemption by
other jobs

* defrag daemon (looks to be working more or less orthogonal to all
these rules)


Now the big question - is this too complex to implement or should this
be possible (and if so, how?)

Cheers and thanks a lot in advance!

Carsten

PS: Obviously, once we get this set-up up and running, we will document
it as a complex example configuration if you want to re-use it!
-- 
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature