Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] jobs wait in idle mode unecessarily
- Date: Tue, 22 Jun 2004 13:44:20 -0500
- From: Erik Paulson <epaulson@xxxxxxxxxxx>
- Subject: Re: [Condor-users] jobs wait in idle mode unecessarily
On Mon, Jun 21, 2004 at 12:38:15PM +0100, Dr Ian C. Smith wrote:
> It's a vanilla job and the file permissions are OK (it's
> under win 2k). Also there are no nice user options
> specified. Unfortunately I can't seem to reproduce it at
> the moment but I'm getting a similar possibly related
> problem that killed jobs hang around in the idle state.
>
What do you mean "killed jobs hang around in the idle state"?
> C:\Condor\ics>condor_q -analyze
> -- Submitter: 102153-71130c.liv.ac.uk : <138.253.102.153:1042> :
> 102153-71130c.l
> iv.ac.uk
> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
> ---
> 186.000: Run analysis summary. Of 2 machines,
> 1 are rejected by your job's requirements
> 0 reject your job because of their own requirements
> 0 match, but are serving users with a better priority in the pool
> 1 match, but prefer another specific job despite its worse
> user-priority
> 0 match, but will not currently preempt their existing job
> 0 are available to run your job
> Last successful match: Mon Jun 21 12:31:39 2004
>
> 1 jobs; 1 idle, 0 running, 0 held
>
> This from SchedLog looks pertinent:
>
> 6/21 12:22:09 DaemonCore: Command received via TCP from host
> <138.253.102.153:2309>
> 6/21 12:22:09 DaemonCore: received command 443 (VACATE_SERVICE), calling
> handler (vacate_service)
> 6/21 12:22:09 Got VACATE_SERVICE from <138.253.102.153:2309>
> 6/21 12:22:09 Sent RELEASE_CLAIM to startd on <138.253.102.153:1041>
> 6/21 12:22:09 Match record (<138.253.102.153:1041>, 183, 0) deleted
> 6/21 12:22:09 DaemonCore: Command received via UDP from host
> <138.253.102.153:2311>
> 6/21 12:22:09 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling
> handler (HandleProcessExitCommand())
> 6/21 12:22:09 Scheduler::Relinquish - mrec is NULL, can't relinquish
> 6/21 12:22:09 Null parameter --- match not deleted
>
It is only a snippet, and not enough to tell us anything.
To debug this, the first question to ask is "does this job ever match?" - ie
does Condor ever even try to start the job. It seems from the above that
it does - so condor_q -analyze isn't going to tell us anything more.
What would help the most would be:
1. The full schedd log
2. The shadow log
3. The job log file (ie the file that you set with 'log = somelogfile.log' in
your submit file)
3. The starterlog from the execute machine.
It would also be handy to have the full output of 'condor_q -l' and
'condor_status -l'
<...>
>
> >> 1 match, but prefer another specific job despite its worse
> >>user-priority
> >
> >I think there are quite a number of things that cause this.
> >
Indeed - in 6.6.6, we've changed this error message to be more (less?)
helpful - it now will say "1 match, but reject the job for unknown reasons"
Now at least it won't send you off on a wild goose chase.
-Erik