Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] What causes (almost simultaneous) slot re-use?

Date: Tue, 21 Jan 2025 11:11:08 -0600 (CST)
From: Todd L Miller <tlmiller@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] What causes (almost simultaneous) slot re-use?

As far as I know, the startd keeps all of its runtime state inRAM; it's unlikely that the machine in question is actually running morethan one job simultaneously, but I guess you should check.

It seems more likely that this a reporting problem of some kind,probably caused by the startd being able to spawn the starter (if the diskis more-or-less functional but in read-only mode) but the starter dying ina way that leaves the shadow hoping it will be able to reconnect.

As far as who "knows", only the startd can say if a job canstart on that startd, and as far as I know, nobody else makes any attemptat consistency checking. (Starters dying because they can't write to theexecute directory may also cause left-overs in the collector; I don'tknow.)

You'd have to do a little digging -- look at the corresponding joband shadow log(s) to check this hypothesis.

Is there anything I can do - in addition to adding an aggressive healthchecker?

If you can characterize your usual job load, you should be able tocheck from a machine other than the startd if it's starting jobs tooquickly; something like RecentJobBusyTimeAvg --


https://htcondor.readthedocs.io/en/latest/classad-attributes/machine-classad-attributes.html#RecentJobBusyTimeAvg

-- might work. You could then prevent machines which aren'tbehaving from being matched by setting NEGOTIATOR_SLOT_CONSTRAINT --


https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#NEGOTIATOR_SLOT_CONSTRAINT

on the central manager.

-- ToddM

References:
- [HTCondor-users] What causes (almost simultaneous) slot re-use?
  - From: Steffen Grunewald

Prev by Date: Re: [HTCondor-users] Condor 23.X.X for Windows not visible under Programs and Features
Next by Date: Re: [HTCondor-users] condor_rm job, Permission denied to force removal
Previous by thread: [HTCondor-users] What causes (almost simultaneous) slot re-use?
Next by thread: [HTCondor-users] Issue with Kerberos auth and IPv6-only worker node
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] What causes (almost simultaneous) slot re-use?