Mailing List Archives
	Authenticated access
	
	
     | 
    
	 
	 
     | 
    
	
	 
     | 
  
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] negotiation weirdness
- Date: Wed, 31 Oct 2007 12:24:09 -0400
 
- From: "Ian Chesal" <ICHESAL@xxxxxxxxxx>
 
- Subject: Re: [Condor-users] negotiation weirdness
 
Speaking of CLAIM_WORKLIFE and a schedd's claim on a slot: 
when a job finishes running in a slot and the claim is held by the schedd what's 
the algorithm for picking the next job that should run in that slot from the 
list of jobs in the schedd?
 
We're using auto-retirement on our jobs but it means we're 
hit with about an 8% efficiency penalty due to the negotiating overhead. That 
is: we can ever fill our bigger pools, push them to 100% utilization, we always 
see ~8% of our pool unutilized as jobs finish and machines get renegotiated. We 
automatically put machines inot the retirement state after 20 
minutes.
 
- Ian
  
  Grant,
The negotiation process can be a subtle process to debug. 
  The condor negotiator creates matches between schedulers and machines. These 
  matches mean that a slot will be claimed by a scheduler. This claim will span 
  multiple job executions, so for efficiency reasons, when the slot is done with 
  a job, it requests another job from the scheduler with the same significant 
  attributes. 
In this case, I suspect that the other Sergey jobs are 
  finishing, and the new ones are started because the machine is still "Claimed" 
  by the sergey scheduler. You can specify how long this claim will stay in 
  effect using the "CLAIM_WORKLIFE" setting in 6.8.*. The default is -1, and 
  will thus cause *all* the sergey jobs to finish executing. If you set it to 
  say, 1 second, then the first job to execute should finish executing 
  (presumably longer than 1 second) and the claim will be released for a new 
  match-making cycle. 
Good luck, I believe this is the issue, and let me 
  know how this works out for you.
Best,
Jason
-- 
  
===================================
Jason A. Stowe
Phone: 
  607.227.9686
jstowe@xxxxxxxxxxxxxxxxxx
Cycle 
  Computing, LLC
Enterprise Condor 
  Support
http://www.cyclecomputing.com
On 
  10/31/07, Grant Goodyear < grant@xxxxxxxxxxxxxxxxx> 
  wrote:
> I'm seeing somewhat strange results in job 
  negotiation/scheduling.
> 
> We're running a small (~60-node) 
  condor cluster on a dozen or so windows 
> boxes.  One box 
  (crossroads) is the central manager (submit,manage), and
> the rest are 
  all dedicated submit,execute machines with preemption
> turned 
  off.  (The node config can be seen in
> http://www.grantgoodyear.org/~grant/condorlogs/condor_config.txt 
  )
> When one user submits a large number of jobs, we're seeing his jobs 
  get
> scheduled despite the fact that other users have better 
  priorities. 
> 
> Here's a 10-minute view of what's running and 
  the user priorities:
> 
> Oct. 30, 10:40am
> http://www.grantgoodyear.org/~grant/condorlogs/running_200710301040.txt 
  
> http://www.grantgoodyear.org/~grant/condorlogs/priorities_200710301040.txt
> 
  
> Oct. 30, 10:50am
> http://www.grantgoodyear.org/~grant/condorlogs/running_200710301050.txt
> 
  http://www.grantgoodyear.org/~grant/condorlogs/priorities_200710301050.txt 
  
> 
> We script the submission files, and use group 
  accounting, so even though
> all jobs have the same owner, all of the 
  jobs run from c:\sergey have
> +AccountingGroup = "sergey" set, the 
  c:\jgalford jobs are in the 
> "jgalford" group, and the c:\ljacobson 
  job is in the "ljacobson" group.
> 
> At 10:40, sergey has an 
  effective priority of 9.57, jobs 52800-52877
> (submitted on crossroads) 
  are running, and jobs 52878-53481 (crossroads) 
> are 
  waiting.  Group ljacobson has job 270 (submitted from 
  littleboy)
> running, and nothing waiting in the queue.  His 
  priority is 0.51, but
> since he has nothing waiting it doesn't 
  matter.  Group jgalford has job 498 
> (submitted from fatman) 
  running, jobs 483-487 (submitted from
> greenhouse) running, and jobs 
  499-514 (submitted from fatman) waiting.
> The jgalford effective 
  priority is 3.66.
> 
> So, if I understand the way the negotiation 
  process works, the waiting 
> jobs should be sorted so that the jgalford 
  job 499 (fatman) should be
> the next job chosen when a resource frees 
  up, and that would be followed
> by 500 (fatman), ....
> 
> 
  At 10:50, sergey jobs 52800-52808 (crossroads) have finished, and now 
> 
  sergey jobs 52809-52904 (crossroads) are running.  No new 
  jgalford
> jobs have started, despite the lower effective 
  priority.
> 
> I've included the crossroads log files
> ( 
  http://www.grantgoodyear.org/~grant/condorlogs/) for this time
> 
  period.  I'm not seeing anything in the logs that explains 
  this
> behavior, but I'm hoping somebody else has better 
  insight.
> 
> I'm thoroughly confused.
> 
> 
  Help?
> 
> Thanks,
> Grant Goodyear
> --
> Grant 
  Goodyear
> web: http://www.grantgoodyear.org 
> 
  e-mail: grant@xxxxxxxxxxxxxxxxx
> 
  _______________________________________________
> Condor-users mailing 
  list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx 
  with a
> subject: Unsubscribe
> You can also unsubscribe by 
  visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
  
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/ 
  
> 
|
Confidentiality Notice. This message may contain information that is confidential or otherwise protected from disclosure.
If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying 
 of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the 
 sender by reply e-mail, and delete the message and any attachments.  Thank you.