Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Condor 6.8.n: job scheduling process delays
- Date: Wed, 14 Oct 2009 08:47:11 -0500
- From: Daniel Forrest <dan.forrest@xxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Condor 6.8.n: job scheduling process delays
Kevin.Buckley@xxxxxxxxxxxxx wrote:
>
> > You may want to check the StartLog on the machine in question. It
> > appears that there may be some network issues between the shadow and
> > starter.
> >
> > When the shadow returned 100, I believe that is the OS errno.
> >
> > For Linux that is:
> > #define ENETDOWN 100 /* Network is down */
>
> OK, will do.
No, the shadow exit code is not a Unix errno value.
See: http://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=MagicNumbers
100 - JOB_EXITED - The job exited (not killed)
This is the normal status value for a completed job.
> > Also, if "<IP.AD.DR.ESS:port>" is the actual value in the logs then you
> > likely have some condor_config issues (again check your execute node),
> > which I believe could lend to the afore mentioned error.
>
> Nope, that was just me anonymising things.
Back to your original question, this is entirely a scalability issue.
Prior to the 6.9.3 release the schedd simply couldn't handle more than
a few thousand jobs in the job queue without a severe degradation in
performance. I believe your previous message stated you had around
17,500 jobs in the queue - this simply won't work with Condor 6.8.
The easiest temporary solution is to only have a few thousand jobs in
the queue. Since it was a single user with that many jobs, maybe they
can submit them manually in chunks or convert to using DAGMan. There
was a current thread, "restricting the number of jobs", that talked
about this:
https://lists.cs.wisc.edu/archive/condor-users/2009-October/msg00068.shtml
--
Daniel K. Forrest Space Science and
dan.forrest@xxxxxxxxxxxxx Engineering Center
(608) 890 - 0558 University of Wisconsin, Madison