[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] What causes (almost simultaneous) slot re-use?



As far as I know, the startd keeps all of its runtime state in RAM; it's unlikely that the machine in question is actually running more than one job simultaneously, but I guess you should check.

It seems more likely that this a reporting problem of some kind, probably caused by the startd being able to spawn the starter (if the disk is more-or-less functional but in read-only mode) but the starter dying in a way that leaves the shadow hoping it will be able to reconnect.

As far as who "knows", only the startd can say if a job can start on that startd, and as far as I know, nobody else makes any attempt at consistency checking. (Starters dying because they can't write to the execute directory may also cause left-overs in the collector; I don't know.)

You'd have to do a little digging -- look at the corresponding job and shadow log(s) to check this hypothesis.

Is there anything I can do - in addition to adding an aggressive health checker?

If you can characterize your usual job load, you should be able to check from a machine other than the startd if it's starting jobs too quickly; something like RecentJobBusyTimeAvg --

https://htcondor.readthedocs.io/en/latest/classad-attributes/machine-classad-attributes.html#RecentJobBusyTimeAvg

-- might work. You could then prevent machines which aren't behaving from being matched by setting NEGOTIATOR_SLOT_CONSTRAINT --

https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#NEGOTIATOR_SLOT_CONSTRAINT

on the central manager.

-- ToddM