Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Avoid failing nodes? (automatically?)
- Date: Fri, 30 Nov 2007 08:16:50 +0100
- From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
- Subject: [Condor-users] Avoid failing nodes? (automatically?)
Good morning,
every now and then, in a pool that's quite old, I see disk problems
resulting in filesystems remounted read-only.
Such a node will happily accept Condor jobs, fail running them, and
be re-negotiated for another one (from the same user, due to still active
claims).
This is like a black hole, eating all jobs in no time.
Is there a way to avoid such a situation (except monitoring all the nodes
continuously, which may be impossible locally - when a monitor script
cannot run anymore because of the disk failure - and would impose extra
network load if done remotely)? Limit the rate of jobs being negotiated
to an individual node? A "learning" process on the negotiator side which
"sees" that this node doesn't produce successful job terminations anymore?
Cheers,
Steffen
--
Steffen Grunewald * MPI Grav.Phys.(AEI) * Am Mühlenberg 1, D-14476 Potsdam
Cluster Admin * http://pandora.aei.mpg.de/merlin/ * http://www.aei.mpg.de/
* e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7233,fax:7298}
No Word/PPT mails - http://www.gnu.org/philosophy/no-word-attachments.html