[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Lazy jobs that never really start running



On 7/6/05, Horvatth Szabolcs <szabolcs@xxxxxxxxxxxxx> wrote:
> >How many computing nodes do you have.
> 
> 28 computers, 56 processors at the moment, dedicated to computing.

right - should be fine
 
> > How much effort is it to start/finish* the jobs.
> 
> All files are read from a mapped server, the only data that is transferred is
> the dagman application (by default) and a small batch file that launches the
> computation for the jobs. (~500byte)

You're transferring dagman itself? Why?
 
> The max jobs limit of 200 is optimistic indeed but since I only have 56 processors
> there is no way of launching more than that amount at the same time.
> (Shadows are only started when the process is matched and launched on a machine, am I right?)

That's correct the exact mechnics are in a previous post in the
archives if you want that.

> The strange thing was that Condor wrote about this maxjob limit in the log file while condor_status
> only reported 4 running processes.

condor_status reports what the *collector* says. this is always
delayed (or plain inaccurate if there are problems with a machine as
it tends to fail to report the right thing).

The machine may be loosing track of the shadows.
 
> >Hard to say without more logs. Was there anything in the ShadowLog?
> 
> No, there were no problems in the shadow log. Since shadows were not launched at all
> the log remained empty. The problem might be somewhere in the matchmaking department.

How about the MasterLog (reports of processes dying and the like
 
> The strangest thing is that restarting the scheduling machine did fix the problem, so both
> configuration issues, job/machine requirements, computer limitations are out of the question.

Does a condor_reconfig do the same?

How about net stop condor/net start condor?

This would isolate some machine specific issue (dodgy memory, overheating etc)

Matt