[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems when submitting very large numbers ofjobs to the queue



On Tue, 30 Nov 2004 10:50:43 +1100, Christopher Mellen
<chris.mellen@xxxxxxxxxx> wrote:
> We're using Condor 6.6.7 on a Cluster of ~20 3Ghz Windows XP machines.
> 
> One of our users has encountered the situation that if he simultaneously
> submits say 500+ jobs to the queue then the scheduling/matching process
> appears to fail. Unclaimed machines will enter the 'Matched' state, but the
> match will nearly always time out (according to the startd logs on the
> respective machines) before the job can be started. The machine then returns
> to the 'Unclaimed' state.
> 
> If he submits say, only 50 jobs at a time, the scheduling/matching process
> works without a hitch.
> 
> Is there a limit to the number of jobs that can be reliably queued ??
> 
> So far I've not been able to gain any insight from the manual. Any
> suggestions/hints much appreciated ...

the schedd is responsible for activating each match (either itself or
via the shadow it launches).
You match to a machine will (sensibly) not last forever though you can
tune it to be longer in the config files.

The problem arises if some of the following conditions apply

a) your submitter machine is too slow
b) your submitter is overloaded
c) your submitter is also running a startd (see b)
d) your jobs involve the transfer of significant amounts of
input/executable data

This is compounded by the schedd being only able to service one claim
activation at a time, but (as is likely if you submit to an unused
farm) all the startd'd are requesting activation at once.

Some solutions

tuning - remove or disable the startd on your submitter machine,
especially if you are running a high load submitter

manual - submit all your jobs on hold and either use periodic release
(complex but then out of your hands) or just use condor_release
yourself with some constraint to release a few at a time.

throttling - set your max jobs running lower (if your submitter
machine just isn't capable of handling things) obvious negatives

Up the timeout - set MATCH_TIMEOUT to be bigger

manual releasing is the least error prone and most configurable since
you have total control over the process...

Matt