Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] What causes (almost simultaneous) slot re-use?
- Date: Tue, 21 Jan 2025 11:11:08 -0600 (CST)
- From: Todd L Miller <tlmiller@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] What causes (almost simultaneous) slot re-use?
As far as I know, the startd keeps all of its runtime state in
RAM; it's unlikely that the machine in question is actually running more
than one job simultaneously, but I guess you should check.
It seems more likely that this a reporting problem of some kind,
probably caused by the startd being able to spawn the starter (if the disk
is more-or-less functional but in read-only mode) but the starter dying in
a way that leaves the shadow hoping it will be able to reconnect.
As far as who "knows", only the startd can say if a job can
start on that startd, and as far as I know, nobody else makes any attempt
at consistency checking. (Starters dying because they can't write to the
execute directory may also cause left-overs in the collector; I don't
know.)
You'd have to do a little digging -- look at the corresponding job
and shadow log(s) to check this hypothesis.
Is there anything I can do - in addition to adding an aggressive health
checker?
If you can characterize your usual job load, you should be able to
check from a machine other than the startd if it's starting jobs too
quickly; something like RecentJobBusyTimeAvg --
https://htcondor.readthedocs.io/en/latest/classad-attributes/machine-classad-attributes.html#RecentJobBusyTimeAvg
-- might work. You could then prevent machines which aren't
behaving from being matched by setting NEGOTIATOR_SLOT_CONSTRAINT --
https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#NEGOTIATOR_SLOT_CONSTRAINT
on the central manager.
-- ToddM