Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Lazy jobs that never really start running
- Date: Wed, 6 Jul 2005 13:39:23 +0100
- From: Matt Hope <matthew.hope@xxxxxxxxx>
- Subject: Re: [Condor-users] Lazy jobs that never really start running
On 7/6/05, Horvatth Szabolcs <szabolcs@xxxxxxxxxxxxx> wrote:
> >How many computing nodes do you have.
>
> 28 computers, 56 processors at the moment, dedicated to computing.
right - should be fine
> > How much effort is it to start/finish* the jobs.
>
> All files are read from a mapped server, the only data that is transferred is
> the dagman application (by default) and a small batch file that launches the
> computation for the jobs. (~500byte)
You're transferring dagman itself? Why?
> The max jobs limit of 200 is optimistic indeed but since I only have 56 processors
> there is no way of launching more than that amount at the same time.
> (Shadows are only started when the process is matched and launched on a machine, am I right?)
That's correct the exact mechnics are in a previous post in the
archives if you want that.
> The strange thing was that Condor wrote about this maxjob limit in the log file while condor_status
> only reported 4 running processes.
condor_status reports what the *collector* says. this is always
delayed (or plain inaccurate if there are problems with a machine as
it tends to fail to report the right thing).
The machine may be loosing track of the shadows.
> >Hard to say without more logs. Was there anything in the ShadowLog?
>
> No, there were no problems in the shadow log. Since shadows were not launched at all
> the log remained empty. The problem might be somewhere in the matchmaking department.
How about the MasterLog (reports of processes dying and the like
> The strangest thing is that restarting the scheduling machine did fix the problem, so both
> configuration issues, job/machine requirements, computer limitations are out of the question.
Does a condor_reconfig do the same?
How about net stop condor/net start condor?
This would isolate some machine specific issue (dodgy memory, overheating etc)
Matt