Re: [HTCondor-users] How is autoclustering supposed to work and how to influence it?

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

On 11 Dec 2024, at 19:16, Bockelman, Brian <BBockelman@xxxxxxxxxxxxx> wrote:

Hi Jeff,

I think the "good news" is that the particular symptom you describe cannot come from any autoclustering issues. So, we can definitely go down the route of debugging your autoclustering setup (but it's not going to fix the described issue). FWIW -- other than the most extreme scales (>500k), I can't think of any reason why you'd want to manually tweak autoclusters.

The reason I make the above statement is that, after the negotiator provides the match to the schedd process, both the schedd and startd will re-evaluate the requirements expressions in the job. Hence, if the negotiator "gets it wrong", that's not sufficient to get the job started on the node.

I'd double-check the contents of the START _expression_ in the machine ad and the Requirements for the job to look for typos or other logical mistakes.

(Here, I'm assuming that you are implementing "should not be scheduled" via requirements; if you're using some other scheduling mechanism, let us know!)

The distribution to different nodes is done by the negotiator - there is no separate fence via requirements.

One negotiator says:

02-central-manager.config:NEGOTIATOR_DEPTH_FIRST = false

20-negotiator-constraint.config:NEGOTIATOR_SLOT_CONSTRAINT = ! ( regexp("wn-sate-079", Machine) || regexp("wn-lot", Machine) || regexp("wn-pijl", Machine) )

20-negotiator-constraint.config:NEGOTIATOR_JOB_CONSTRAINT = MaxWallTime <= 24*3600

The other says:

03-negotiator.config:DAEMON_LIST = MASTER NEGOTIATOR

03-negotiator.config:COLLECTOR_HOST_FOR_NEGOTIATOR = stbc-019.nikhef.nl

03-negotiator.config:NEGOTIATOR_DEPTH_FIRST = false

03-negotiator.config:NEGOTIATOR_INTERVAL = 179

03-negotiator.config:NEGOTIATOR_MIN_INTERVAL = 67

20-negotiator-constraint.config:NEGOTIATOR_SLOT_CONSTRAINT = regexp("wn-lot", Machine) || regexp("wn-pijl", Machine)

20-negotiator-constraint.config:NEGOTIATOR_JOB_CONSTRAINT = (MaxWallTime > 24*3600) || (time()-QDate > 90)

(Thereâs a third negotiator for wn-sate-079, but no jobs seem to get mis-scheduled there)

I have test jobs that I submit, that are identical except that I know that given a certain parameter, some of them will take longer than 24 hours, so I feed them a different wall time via a â-appendâ on the condor_submit command. Thatâs the only difference between the jobs, and in that case, if the long and short are submitted within seconds of each other, the long jobs wind up on the short node classes. Adding a different memory request to short vs long jobs results in things working as intended. Hence my suspecting the autoclustering. Your statement

The reason I make the above statement is that, after the negotiator provides the match to the schedd process, both the schedd and startd will re-evaluate the requirements expressions in the job. Hence, if the negotiator "gets it wrong", that's not sufficient to get the job started on the node.

So there are no requirements on the nodes or in the jobs, that would prevent a âgot it wrong at the negotiatorâ job from running. If the negotiator gets it wrong, the job goes to the wrong place.

HTH,

Mailing List Archives

Authenticated access

Re: [HTCondor-users] How is autoclustering supposed to work and how to influence it?