Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] What causes (almost simultaneous) slot re-use?

Date: Tue, 21 Jan 2025 09:27:15 +0100
From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
Subject: [HTCondor-users] What causes (almost simultaneous) slot re-use?

Hi all,

in the context of exec nodes failing due to disk path problems, I noticed
things like this (result of `condor_q -run ... | grep h1210`):

  36480.0   user1           1/20 14:21   0+00:27:56 slot1_1@xxxxxxxxxxxxxxxxxxx
  36483.0   user1           1/20 17:24   0+00:41:40 slot1_1@xxxxxxxxxxxxxxxxxxx
 187612.0   user2           1/20 21:11   0+00:49:35 slot1_1@xxxxxxxxxxxxxxxxxxx
 187620.0   user2           1/20 21:18   0+00:50:40 slot1_1@xxxxxxxxxxxxxxxxxxx
 187624.0   user2           1/20 21:21   0+00:43:57 slot1_1@xxxxxxxxxxxxxxxxxxx
 187626.0   user2           1/20 23:25   0+00:45:49 slot1_1@xxxxxxxxxxxxxxxxxxx
 110001.0   user3           1/20 15:30   0+00:33:33 slot1_1@xxxxxxxxxxxxxxxxxxx
 110006.0   user3           1/20 15:30   0+00:33:56 slot1_1@xxxxxxxxxxxxxxxxxxx
 110007.0   user3           1/20 15:30   0+00:37:38 slot1_1@xxxxxxxxxxxxxxxxxxx
 110008.0   user3           1/20 15:30   0+00:24:38 slot1_1@xxxxxxxxxxxxxxxxxxx
 110009.0   user3           1/20 15:30   0+00:23:32 slot1_1@xxxxxxxxxxxxxxxxxxx
 110051.0   user3           1/20 18:18   0+00:48:51 slot1_1@xxxxxxxxxxxxxxxxxxx

For historical reasons, we're running dynamic partitioning, but this node type
is configured to accept only full-node (in terms of request_cpus) jobs, so there's
always a single "child" slot slot1_1.

Due to the disk failure, I imagine that STARTD doesn't keep track of having assigned
the slot yet (the accumulated runtime doesn't allow to determine which one that would
be), but shouldn't the MASTER (collector/negotiator) know? 
As the job numbers suggest, we have three SCHEDDs involved (used by the three users
affected) - shouldn't they also know?

If this all boils down to failure in updating the machine classad: how would I make
sure that one gets kept in memory (tmpfs, e.g. below /run) so it can be kept updated
no matter what happens to the disk? This wouldn't help the first job to be scheduled,
as it would still hit a read-only disk, but would prevent the next ones to be sucked
into the black hole.

Is there anything I can do - in addition to adding an aggressive health checker?


Thanks,
 Steffen


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

Follow-Ups:
- Re: [HTCondor-users] What causes (almost simultaneous) slot re-use?
  - From: Todd L Miller

Prev by Date: [HTCondor-users] STARTD_CRON module for node health check?
Next by Date: [HTCondor-users] Issue with Kerberos auth and IPv6-only worker node
Previous by thread: Re: [HTCondor-users] [External] Re: STARTD_CRON module for node health check?
Next by thread: Re: [HTCondor-users] What causes (almost simultaneous) slot re-use?
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] What causes (almost simultaneous) slot re-use?