Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool
- Date: Sun, 05 Apr 2020 22:17:22 -0500
- From: Gregory Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] implement scheduled downtimes for one accountinggroup in the pool
On 4/3/20 3:06 AM, Beyer, Christoph wrote:
Hi all,
Our pool is used by different VOs that match accountinggroups to get the quotas right. Every now and then we do have scheduled downtimes for fileserver maintenance, dcache upgrades etc. for one or more of these VOs.
As we do have all jobs with estimated runtimes it would be the most elegant way to handle these temporary interruptions automated. The begin of downtime should be noted in a config file and then the jobs of the matching VO should be checked if they fit in to the remaining time window.
Hi Christoph:
I don't think we have a good way to do this at the negotiator level.
The best practice that we recommend for worker nodes that have shared
filesystems is to write a STARTD_CRON for each filesystem that detects
if the filesystem is healthy, and advertise that in the startd classad
as a boolean. Jobs that need those shared filesystems add this boolean
attribute to their job requirements, so they don't match machines with
bad filesystem mounts.
One idea is to extend this, so you don't advertise a boolean, but rather
some kind of time that you suspect the mount is good until, and factor
this into the job's requirement expression. I realize this is not the
centralized solution that's best for you, but the startd cron could read
this data from a centralized place before advertising it. It would also
add in local knowledge, testing whether the fileystem mount is currently
working on that node.
Does this sound like the kind of hack you can live with?
-greg