[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Queue problems



At 10:15 AM 4/18/2006, Andy Wettstein wrote:
Hi

We're running condor 6.7.18 and have noticed a problem when we add a
machine requirement to the submit file.  We have a submit file like
this:

Executable     = hello.sh
Universe       = vanilla
Output         = hello.out
Log            = hello.log
Requirements   = (machine == "xxx1")
Queue 10

hello.sh just echoes hello and sleeps for 30 seconds.  If we submit this
job and then change to machine xxx2 and submit again, we don't get any
jobs run on xxx2 until all the jobs on xxx1 have completed.  From what I
can tell, when we submit jobs this way condor stops trying to match
jobs in the queue after it rejects a job.  So since xxx1 has 4 vm's it
condor will start 4 jobs on it, then see it can't run the next job, and
then just skip the rest of the queue instead of trying to match the jobs
than should be able to run on xxx2.  If we take out the machine
requirement condor does run jobs simultaneously on xxx1 and xxx2 as
expected.

Could this be a configuration error of some sort or is this a bug with
condor?

This is an unfortunate bug that has been recently fixed for the next Condor release. So with v6.7.19+ you should not have to worry about it.

But w/ v6.7.18, there is a bug in the code that automatically sets SIGNIFICANT_ATTRIBUTES.
There are a couple ways you can work around it.

v6.7.18 work around idea #1
Use a submit file that adds one level of indirection to the Requirements, like so :
   executable = hello.sh
   requirements = wanted
   +wanted = (machine == "xxx1")
   queue 10

v6.7.18 work around idea #2
In our condor_config file, add
   SIGNIFICANT_ATTRIBUTES = ClusterId
and then *restart* the schedd (condor_restart -schedd).

Work around #1 will result in better negotiation, but requires changes to all submit files. Work around #2 requires no changes to submit files, but will result in negotiation that performs as good/bad as in Condor v6.6.x.

Again, this has already been fixed in the code for v6.7.19, which would normally appear on the web within a week or so (but this may be delayed by a few days because of the Condor Week conference in Madison, WI next week). Note that v6.7.19 of Condor is the *last* developer release before the next v6.8.0 stable release.

Sorry for this hassle w/ v6.7.18,
regards,
Todd



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257
http://www.cs.wisc.edu/~tannenba      Madison, WI 53706-1685
Phone: (608) 263-7132  FAX: (608) 262-9777