[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-devel] somewhat evil problem with work fetch and schedd claims. :(
- Date: Wed, 6 Feb 2008 00:32:43 -0800
- From: Derek Wright <wright@xxxxxxxxxxx>
- Subject: Re: [Condor-devel] somewhat evil problem with work fetch and schedd claims. :(
On Feb 6, 2008, at 12:15 AM, Daniel Forrest wrote:
I won't claim to understand exactly what you're describing here, but
any problem which is exacerbated by short running jobs is a problem
that needs to be fixed. The problem is that short running jobs also
include ill-behaved jobs (i.e. jobs that fail almost immediately from
some job related error), and if these jobs are set to retry on error
then you have an unintentional DoS attack on your pool.
Right, good point. A few countervailing tendencies:
A) there's an expression you can define for how often the startd will
fetch work. this is a classad expression evaluated in the context of
the slot ad, so you can do quite fancy things if you wanted.
B) this is only an issue if you have a startd that is both trying to
fetch work externally *and* you're expecting regular schedd claims to
land there, too. while it's still early to tell, it seems like the
only viable use-cases for any of this startd work-fetching stuff
involve fetching work exclusively, not a mix of both.
This is a real concern. We have had several incidents of this type on
GLOW and it is an incredible PITA to have to first identify what is
going on and then disable the source of the bad jobs.
Agreed.
Please do not take option A.
I'd rather not, but it's not entirely up to me to decide. We'll
discuss this tomorrow in a meeting anyway, so we should have an
answer soon enough.
Thanks for your input,
-Derek