Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Tracing why nodes reject jobs?
- Date: Fri, 16 Jul 2010 10:03:42 -0400
- From: "Jonathan D. Proulx" <jon@xxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Tracing why nodes reject jobs?
On Fri, Jul 16, 2010 at 11:10:03AM +0100, Ian Cottam wrote:
:Is it memory? Try adding the Requirement Memory > 0
:and also
:Rank = Memory
Nope, memopry was my first guess too, but the job size is small enough
to fit on any of our nodes and some have 8x enough for them...
-Jon
:
:-Ian
:
:[currently out of office]
:
:On 15 Jul 2010, at 16:52, "Jonathan D. Proulx" <jon@xxxxxxxxxxxxx> wrote:
:
:> Hi All,
:>
:> I have a user who queued a couple hundred identical Standard Universe
:> jobs (well the parameters were a little different but the class ads
:> were the same), most completed but 15 are hanging aroundin idle state
:> after having accumulated some runtime, but will no longer match any
:> execute nodes:
:>
:> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
:> ---
:> 78745.005: Run analysis summary. Of 429 machines,
:> 19 are rejected by your job's requirements
:> 410 reject your job because of their own requirements
:> 0 match but are serving users with a better priority
:> in the pool
:> 0 match but reject the job for unknown reasons
:> 0 match but will not currently preempt their existing job
:> 0 match but are currently offline
:> ) are available to run your job
:> Last successful match: Tue Jul 617:33:30 2010
:> Last failed match: Thu Jul 15 11:46:30 2010
:> Reason for last match failure: no match found
:>
:> The 19 rejected for Job requirements are clear (wrong ARCH), the 410
:> for node rrequirements is odd in several ways:
:>
:> 1) there are 410 total systems available and 344 are currently claimed
:> so I'd expect those to be either "match but are serving users with a
:> better priority in the pool" or "match but will not currently preempt
:> their existing job"
:>
:> 2) clearly they used to match some or the job wouldn't have had runtime
:>
:> Where/how can I see why a specific node rejects a given job?
:>
:> -Jon
:> _______________________________________________
:> Condor-users mailing list
:> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
:> subject: Unsubscribe
:> You can also unsubscribe by visiting
:> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
:>
:> The archives can be found at:
:> https://lists.cs.wisc.edu/archive/condor-users/
:_______________________________________________
:Condor-users mailing list
:To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
:subject: Unsubscribe
:You can also unsubscribe by visiting
:https://lists.cs.wisc.edu/mailman/listinfo/condor-users
:
:The archives can be found at:
:https://lists.cs.wisc.edu/archive/condor-users/