[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-devel] somewhat evil problem with work fetch and schedd claims. :(
- Date: Wed, 6 Feb 2008 02:15:25 -0600
- From: Daniel Forrest <forrest@xxxxxxxxxxxxx>
- Subject: Re: [Condor-devel] somewhat evil problem with work fetch and schedd claims. :(
Derek,
> dang... i guess this is exactly the sort of thing you don't think
> about until you're debugging and see it happening live. :(
>
> there's a annoying bug in the interaction of fetched claims and
> schedd-pushed ones. not sure what to do about this -- todd: i'm
> guessing you're going to say "screw it, not important", which is why
> i'm writing this up before i spend any more time on it. ;)
<snip>
> A) won't fix. either have longer running jobs from your fetched
> work, or just be happy that your startd is so busy with your fetched
> work, and forget about schedd-based claims. ;)
I won't claim to understand exactly what you're describing here, but
any problem which is exacerbated by short running jobs is a problem
that needs to be fixed. The problem is that short running jobs also
include ill-behaved jobs (i.e. jobs that fail almost immediately from
some job related error), and if these jobs are set to retry on error
then you have an unintentional DoS attack on your pool.
This is a real concern. We have had several incidents of this type on
GLOW and it is an incredible PITA to have to first identify what is
going on and then disable the source of the bad jobs.
Please do not take option A.
--
Daniel K. Forrest Laboratory for Molecular and
forrest@xxxxxxxxxxxxx Computational Genomics
(608) 262 - 9479 University of Wisconsin, Madison