Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Avoid failing nodes? (automatically?)
- Date: Fri, 30 Nov 2007 09:57:16 +0100
- From: Jan Ploski <Jan.Ploski@xxxxxxxx>
- Subject: Re: [Condor-users] Avoid failing nodes? (automatically?)
condor-users-bounces@xxxxxxxxxxx schrieb am 11/30/2007 08:16:50 AM:
> Good morning,
>
> every now and then, in a pool that's quite old, I see disk problems
> resulting in filesystems remounted read-only.
> Such a node will happily accept Condor jobs, fail running them, and
> be re-negotiated for another one (from the same user, due to still
active
> claims).
> This is like a black hole, eating all jobs in no time.
> Is there a way to avoid such a situation (except monitoring all the
nodes
> continuously, which may be impossible locally - when a monitor script
> cannot run anymore because of the disk failure - and would impose extra
> network load if done remotely)? Limit the rate of jobs being negotiated
> to an individual node? A "learning" process on the negotiator side which
> "sees" that this node doesn't produce successful job terminations
anymore?
Maybe match_list_length and LastMatchName0 in job requirements is what you
need (see documentation of condor_submit). There is also an example in
section 5.3.7.3 of the manual (this section is related to Grid
match-making, but the same mechanism works for normal jobs, if I
understand correctly).
Regards,
Jan Ploski