Hi Jeff,
I think the "good news" is that the particular symptom you describe cannot come from any autoclustering issues. So, we can definitely go down the route of debugging your autoclustering setup (but it's not going to fix the described issue). FWIW -- other
than the most extreme scales (>500k), I can't think of any reason why you'd want to manually tweak autoclusters.
The reason I make the above statement is that, after the negotiator provides the match to the schedd process, both the schedd and startd will re-evaluate the requirements expressions in the job. Hence, if the negotiator "gets it wrong", that's not sufficient
to get the job started on the node.
I'd double-check the contents of the START _expression_ in the machine ad and the Requirements for the job to look for typos or other logical mistakes.
(Here, I'm assuming that you are implementing "should not be scheduled" via requirements; if you're using some other scheduling mechanism, let us know!)
Hope this helps,
Brian
On Dec 11, 2024, at 4:30âAM, Jeff Templon <templon@xxxxxxxxx> wrote:
Hi,
With our multiple negotiator setup, weâre seeing weird instances of jobs that should not be scheduled to particular nodes, be scheduled there anyway. It looks to be the autoclustering - there are bunches of jobs being submitted, the only difference between
them being a different ClusterId and a couple of different values for Nikhef-added custom attributes. Although AutoCluster claims to be using these different attributes, either it is not REALLY using them, or else the algorithm looks for âclose enoughâ instead
of âidenticalâ, and then batches some jobs together that should not be scheduled to the same node set.
If I append a different request_memory to the one set vs the other, then they are scheduled correctly to the right nodes, RequestMemory being one of the âoriginalâ AutoCluster attributes.
There are variables SIGNIFICANT_ATTRIBUTES, ADD_SIGNIFICANT_ATTRIBUTES, REMOVE_SIGNIFICANT_ATTRIBUTES that appear to do something but never manage to achieve the effect desired namely, to
not batch jobs together if they have different values for these custom attributes. There are also weird things happening, e.g. if I re-define one of those variables in the config and then do a condor_reconfig, there are remnants from the previous definition
still hanging around, and some values never get removed, even if they are listed in the REMOVE variable.
How is this supposed to work?
JT
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
|