Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Lazy jobs that never really start running
- Date: Wed, 6 Jul 2005 12:58:21 +0100
- From: Matt Hope <matthew.hope@xxxxxxxxx>
- Subject: Re: [Condor-users] Lazy jobs that never really start running
On 7/6/05, Horvatth Szabolcs <szabolcs@xxxxxxxxxxxxx> wrote:
> >Have you done a condor_store_cred ?
> >have you changed your password since you last did...
>
> No, nothing like that.
> The situation is quite simple:
> I submit a few dagman jobs to the queue that spawns about 2000 jobs.
> Hours later a take a look at the queue and find some tasks that newer got matched,
> just sit there being idle or pretend to be running (without any shadow).
>
> There were no configuration changes in the meantime, all jobs had the same "chance"
> and priority to run.
>
> If I restart the computer that submitted the tasks than things seem to be catch up and
> contionue computing but this is a very annoying and brute force solution.
>
> After trying fulldebug for the scheduler I found this:
> 7/6 12:46:53 Reached MAX_JOBS_RUNNING: no more can run, 0 jobs matched, 41 jobs idle
>
> Which is funny because only four jobs were running in reality and 8 were thought to be running.
How many computing nodes do you have. How much effort is it to
start/finish* the jobs.
Your schedd machine may be dying under the load... 200 concurrent jobs
is optimistic on windows without some special tweaks in the registry.
I would suggest 100 is a more realistic maximum to try initially.
Hard to say without more logs. Was there anything in the ShadowLog? Is
your disk filling up?
Matt
* size of files staged to the machine, size of files returned on completion