[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_negotiator/condor_collector scheduling problem



On Fri, May 05, 2006 at 04:36:53PM -0400, Armen Babikyan wrote:
> 
> >>Hi Condor Team,
> >>
> >>A few weeks ago I described a problem I was having with Condor not 
> >>scheduling jobs on available resources.  I've recreated the problem in a 
> >>simpler way, without the need for a DAG.  It seems like 
> >>condor_negotiator and/or condor_collector are somehow misbehaving and 
> >>not matching jobs when there are resources and jobs that match.
> >>
> >>    
> >>
> >
> >It'd be more useful to see the output of 
> >condor_status -l and 
> >condor_q -l 
> >
> >when the situation you're describing is happening, along with
> >the NegotiatorLog, and possibly the ScheddLog.
> >
> 
> In case the inline snippet isn't helpful enough, I've uploaded all the 
> log files to:
> 
> http://www.static.net/~armenb/condor-negotiator-problem/


You're getting burned by Autoclustering, which is turned on in 6.7.18 (and
a few 6.7s back). There is no problem with the negotiator.

Autoclustering allows Condor to "combine" jobs that "look" the same when
it comes to matchmake with them. If one job gets rejected, all the other
jobs that "look the same" will also be rejected, so there is no need to
negotiate for them. 

The problem is Condor believes that RESOUCRE_1 jobs are the same as RESOURCE_2
jobs. When your last RESOURCE_1 job starts running, the next negotiation cycle
that looks for new resources finally sends a RESOURCE_2 job to the negotiator
and the RESOURCE_2 jobs will start. 

I think the problem is the syntax you're using for MY_RESOURCE_1 and _2. 
Try putting 
Requirements = MY_RESOURCE_1 == TRUE
instead of just
Requirements = MY_RESOURCE_1

in your submit file. I think that will show Condor that it needs to consider
MY_RESOURCE_1 as an autocluster'able attribute. (I also think we're still
working on autoclustering to fix problems like this, but I don't know the
exact details)

If that doesn't work, try putting 
NEGOTIATE_ALL_JOBS_IN_CLUSTER = true 

in the config file for your submit machine, and retry your experiments. (I'm
not sure what that does with autoclustering, but it might fix it)

If that doesn't work, this will work for sure. Put:

START = TARGET.ClusterId > 0

into the config file of one of your execute machines, and reconfig the
execute machine. That will effectively turn off autoclustering in
your pool (because it will force Condor to autocluster on ClusterId, which
it usually can ignore)

-Erik