Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] What causes (almost simultaneous) slot re-use?
- Date: Tue, 21 Jan 2025 09:27:15 +0100
- From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
- Subject: [HTCondor-users] What causes (almost simultaneous) slot re-use?
Hi all,
in the context of exec nodes failing due to disk path problems, I noticed
things like this (result of `condor_q -run ... | grep h1210`):
36480.0 user1 1/20 14:21 0+00:27:56 slot1_1@xxxxxxxxxxxxxxxxxxx
36483.0 user1 1/20 17:24 0+00:41:40 slot1_1@xxxxxxxxxxxxxxxxxxx
187612.0 user2 1/20 21:11 0+00:49:35 slot1_1@xxxxxxxxxxxxxxxxxxx
187620.0 user2 1/20 21:18 0+00:50:40 slot1_1@xxxxxxxxxxxxxxxxxxx
187624.0 user2 1/20 21:21 0+00:43:57 slot1_1@xxxxxxxxxxxxxxxxxxx
187626.0 user2 1/20 23:25 0+00:45:49 slot1_1@xxxxxxxxxxxxxxxxxxx
110001.0 user3 1/20 15:30 0+00:33:33 slot1_1@xxxxxxxxxxxxxxxxxxx
110006.0 user3 1/20 15:30 0+00:33:56 slot1_1@xxxxxxxxxxxxxxxxxxx
110007.0 user3 1/20 15:30 0+00:37:38 slot1_1@xxxxxxxxxxxxxxxxxxx
110008.0 user3 1/20 15:30 0+00:24:38 slot1_1@xxxxxxxxxxxxxxxxxxx
110009.0 user3 1/20 15:30 0+00:23:32 slot1_1@xxxxxxxxxxxxxxxxxxx
110051.0 user3 1/20 18:18 0+00:48:51 slot1_1@xxxxxxxxxxxxxxxxxxx
For historical reasons, we're running dynamic partitioning, but this node type
is configured to accept only full-node (in terms of request_cpus) jobs, so there's
always a single "child" slot slot1_1.
Due to the disk failure, I imagine that STARTD doesn't keep track of having assigned
the slot yet (the accumulated runtime doesn't allow to determine which one that would
be), but shouldn't the MASTER (collector/negotiator) know?
As the job numbers suggest, we have three SCHEDDs involved (used by the three users
affected) - shouldn't they also know?
If this all boils down to failure in updating the machine classad: how would I make
sure that one gets kept in memory (tmpfs, e.g. below /run) so it can be kept updated
no matter what happens to the disk? This wouldn't help the first job to be scheduled,
as it would still hit a read-only disk, but would prevent the next ones to be sucked
into the black hole.
Is there anything I can do - in addition to adding an aggressive health checker?
Thanks,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~