Hi Jaime, On 4/1/20 9:32 PM, Jaime Frey wrote: > JobStart and $(ActivationTimer) are only valid once a job starts running (i.e. the slot is in the Busy activity). > Thus, they are not suitable for use in a RANK expression. > $(ActivityTimer) is always valid, and marks the time since the slot changed its Activity value (Busy, Idle, etc). > Have you looked at the MAXJOBRETIREMENTTIME parameter? This seems like an ideal use for it. We looked at it, but we possibly misunderstood it as we initially wanted to "solve" all our preemption only via the negotiator and thought of MAXJOBRETIREMENTTIME to be relevant for the startd only. If I may, here are the "rules" we try to implement and currently fail miserably - do you think those rules are possible to put into a 8.8 or 8.9 pool configuration? For us, we are currently unsure which of these rules should go into startd configuration and which into the negotiator... Over the years our pool went from only a single type of machine to a large collection of different hosts which we will only differentiate to be "CPU" hosts (ranging from 4 physical cores without HT/SMT to two socket machines with 64+64 logical cores) and "GPU" hosts (ranging from small nodes with ancient single GT640/GTX750 card set-ups to recently added systems with 8 state of the art GPUs). CPU hosts: * we guarantee a minimal runtime of each job (wall-clock time!), which should always be honored * actual time is locally configured, may differ between hosts and type of CPU-only hosts and we will try to adjust those to allow smooth operation for all users (e.g. 50% with 5 hours, 25% with 10 hours, 155 with 20 hours and the rest being unlimited). * users have to add their expected run time into the submit file for matching those groups - which in itself may already be hard due to various different CPUs available GPU host: * again, we guarantee a minimal runtime of GPU jobs (wall-clock time!), which should always be honored * as above, locally configured min run time * if CPU resources are available on a GPU host, a CPU job can be started there, but it does only have a guaranteed runtime against other CPU-only jobs * a GPU job is always allowed to preempt a CPU-only job regardless of runtime (obviously, if a GPU resource is available) To further complicate the matter: * interactive jobs: * for testing purposes, users may submit interactive jobs (can the number and run time be limited?) * these may request a number of GPU and CPU resources (obviously within the limits of available machines) * due to their interactive nature, these jobs should be matched as quickly as possible, without breaking the guaranteed runtime rules above * CPU-only jobs should only match against non-GPU hosts * Overall preemption by negotiator should be governed by simple (default) rule of 20% better effective user prio. * dedicated scheduler/parallel universe: * rarely needed by our users * but if needed, high priority a.k.a. not a long waiting time, possibly by manually adjusting defrag daemon? * how to integrate all this with 1-2 dedicated schedulers for subsets of machines? * these jobs once matched should run until done, i.e. no preemption by other jobs * defrag daemon (looks to be working more or less orthogonal to all these rules) Now the big question - is this too complex to implement or should this be possible (and if so, how?) Cheers and thanks a lot in advance! Carsten PS: Obviously, once we get this set-up up and running, we will document it as a complex example configuration if you want to re-use it! -- Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics, CallinstraÃe 38, 30167 Hannover, Germany Phone: +49 511 762 17185
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature