Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] jobs stuck in queue
- Date: Mon, 22 Aug 2011 14:58:02 -0300
- From: Fabricio Cannini <fcannini@xxxxxxxxx>
- Subject: Re: [Condor-users] jobs stuck in queue
Em sexta-feira 19 agosto 2011, às 19:09:54, Koller, Garrett escreveu:
> Mr. Cannini,
>
> I'm not yet familiar with running MPI jobs on Condor, but I think I've come
> across a similar situation. First of all, run 'condor_q -better-analyze'
> to figure out if the job's requirements are causing it to not be scheduled
> in the first place. If it says "not yet considered by matchmaker" or
> something, it usually means that it is being run but encounters an error
> shortly thereafter and so is continuously put back on the queue. Check
> the MatchLog. If it keeps saying that the same job is "Matched", it means
> that the job successfully scheduled but something goes wrong with the
> execute machine. Check which slot and what machine the job is assigned
> to. Go to the log files of that machine and look for the StarterLog for
> that slot. The bottom of that log should tell you what error you program
> encountered that caused it to exit. Let me/us know if this doesn't help
> you diagnose and solve the problem.
>
> Best Regards,
> - Garrett
Hi.
'condor_q -better-analyze 35' says this:
-- Submitter: master.internal.domain : <172.17.8.121:42584> :
master.internal.domain
===============================
---
035.000: Run analysis summary. Of 0 machines,
0 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
0 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 match but are currently offline
0 are available to run your job
WARNING: Be advised:
No resources matched request's constraints
WARNING: Be advised: Request 35.0 did not match any resource's constraints
===============================
The StartLog of both nodes has messages like this:
+++++++++++++++++++++++++++++++
08/19/11 17:21:30 slot3: match_info called
08/19/11 17:21:30 slot3: Received match <172.17.8.51:56215>#1313779372#3#...
08/19/11 17:21:30 slot3: State change: match notification protocol successful
08/19/11 17:21:30 slot3: Changing state: Unclaimed -> Matched
08/19/11 17:21:30 PERMISSION DENIED to unauthenticated@unmapped from host
172.17.8.121 for command 442 (REQUEST_CLAIM), access level DAEMON: reason:
DAEMON authorizatio
n policy contains no matching ALLOW entry for this request; identifiers used
for this host: 172.17.8.121,master,master.internal.domain,internal.domain
08/19/11 17:21:51 slot4: match_info called
08/19/11 17:21:51 slot4: Received match <172.17.8.51:56215>#1313779372#4#...
08/19/11 17:21:51 slot4: State change: match notification protocol successful
08/19/11 17:21:51 slot4: Changing state: Unclaimed -> Matched
08/19/11 17:21:51 PERMISSION DENIED to unauthenticated@unmapped from host
172.17.8.121 for command 442 (REQUEST_CLAIM), access level DAEMON: reason:
cached result for DAEMON; see first case for the full reason
+++++++++++++++++++++++++++++++
TIA