HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] somewhat evil problem with work fetch and schedd claims. :(




On Feb 6, 2008, at 12:15 AM, Daniel Forrest wrote:

I won't claim to understand exactly what you're describing here, but
any problem which is exacerbated by short running jobs is a problem
that needs to be fixed.  The problem is that short running jobs also
include ill-behaved jobs (i.e. jobs that fail almost immediately from
some job related error), and if these jobs are set to retry on error
then you have an unintentional DoS attack on your pool.

Right, good point.  A few countervailing tendencies:

A) there's an expression you can define for how often the startd will fetch work. this is a classad expression evaluated in the context of the slot ad, so you can do quite fancy things if you wanted.

B) this is only an issue if you have a startd that is both trying to fetch work externally *and* you're expecting regular schedd claims to land there, too. while it's still early to tell, it seems like the only viable use-cases for any of this startd work-fetching stuff involve fetching work exclusively, not a mix of both.


This is a real concern.  We have had several incidents of this type on
GLOW and it is an incredible PITA to have to first identify what is
going on and then disable the source of the bad jobs.

Agreed.

Please do not take option A.

I'd rather not, but it's not entirely up to me to decide. We'll discuss this tomorrow in a meeting anyway, so we should have an answer soon enough.

Thanks for your input,
-Derek