[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Unnecessary flocking
- Date: Fri, 14 Dec 2012 16:14:33 +0100
- From: Alexey Smirnov <smirnalex@xxxxxxxxx>
- Subject: [HTCondor-users] Unnecessary flocking
Hi all,
We have two condor pools in our institute. Pool A consists of machines dedicated to calculations only; Pool B consists of staff's desktops. All machines configured with partiotionable slots.
After we configured flocking from Pool A to Pool B I see some strange (at least not desired) behavior of condor.
Pool A has no tasks running. User submits 40 jobs. The pool has 10
8-core computers. Slots are dynamical. Each job requests 2 cores, so all
these jobs exactly fit into the pool. However only 10 first jobs start
running on Pool A, then a portion of jobs immediately flock into Pool B,
and in a minute rest jobs start also on Pool A.
The question is why jobs start to flock when the current pool still have plenty of available resources?
I would expect that flocking starts only after all local resources are exhausted (especially if there are dedicated!)
Thanks a lot for any help how to prevent such an early flocking!
Alexey
=====
Negotiator log for Pool A is below
12/13/12 09:30:54 ---------- Started Negotiation Cycle ----------
12/13/12 09:30:54 Phase 1: Obtaining ads from collector ...
12/13/12 09:30:54 Getting Scheduler, Submitter and Machine ads ...
12/13/12 09:30:54 Trying to query collector
[...]
12/13/12 09:30:54 Sorting 13 ads ...
12/13/12 09:30:54 Getting startd private ads ...
12/13/12 09:30:54 Got ads: 13 public and 10 private
12/13/12 09:30:54 Public ads include 1 submitter, 10 startd
12/13/12 09:30:54 Phase 2: Performing accounting ...
12/13/12 09:30:54 Phase 3: Sorting submitter ads by priority ...
12/13/12 09:30:54 Phase 4.1: Negotiating with schedds ...
12/13/12 09:30:54 numSlots = 10
12/13/12 09:30:54 slotWeightTotal = 80.000000
12/13/12 09:30:54 pieLeft = 80.000
12/13/12 09:30:54 NormalFactor = 1.000000
12/13/12 09:30:54 MaxPrioValue = 1.782045
12/13/12 09:30:54 NumSubmitterAds = 1
12/13/12 09:30:54 Negotiating with [...]
12/13/12 09:30:54 Calculating submitter limit with the following
parameters
12/13/12 09:30:54 SubmitterPrio = 1.782045
12/13/12 09:30:54 SubmitterPrioFactor = 1.000000
12/13/12 09:30:54 submitterShare = 1.000000
12/13/12 09:30:54 submitterAbsShare = 1.000000
12/13/12 09:30:54 submitterLimit = 80.000000
12/13/12 09:30:54 submitterUsage = 0.000000
12/13/12 09:30:54 Socket to [...] already in cache, reusing
12/13/12 09:30:54 Got JOB_INFO command; getting classad/eom
12/13/12 09:30:54 Request 00895.00000:
12/13/12 09:30:54 matchmakingAlgorithm: limit 80.000000 used
0.000000 pieLeft 80.000000
12/13/12 09:30:54 Start of sorting MatchList (len=10)
12/13/12 09:30:54 Finished sorting MatchList
12/13/12 09:30:54 Matched 895.0 [...] preempting none slot1@v160.[...]
12/13/12 09:30:54 Successfully matched with slot1@v160.[...]
12/13/12 09:30:54 Sending SEND_JOB_INFO/eom
[ 8 more jobs matched here]
12/13/12 09:30:54 Got JOB_INFO command; getting classad/eom
12/13/12 09:30:54 Request 00895.00009:
12/13/12 09:30:54 matchmakingAlgorithm: limit 80.000000 used 72.000000 pieLeft 8.000000
12/13/12 09:30:54 Attempting to use cached MatchList: Succeeded. [...]
12/13/12 09:30:54 Matched 895.9 [...] preempting none slot1@v163.[...]
12/13/12 09:30:54 Notifying the accountant
12/13/12 09:30:54 Successfully matched with slot1@v163.[...]
12/13/12 09:30:54 Over submitter resource limit (80.000000, used 80.000000) ... only consider startd ranks
12/13/12 09:30:54 Sending SEND_JOB_INFO/eom
[so condor believes that we don't have any resources left...]
12/13/12 09:30:54 Request 00895.00010:
12/13/12 09:30:54 matchmakingAlgorithm: limit 80.000000 used 80.000000 pieLeft 0.000000
12/13/12 09:30:54 Rejected 895.10 [...]: no match found
12/13/12 09:30:54 Sending SEND_JOB_INFO/eom
12/13/12 09:30:54 Getting reply from schedd ...
12/13/12 09:30:54 Got NO_MORE_JOBS; done negotiating
12/13/12 09:30:54 This submitter hit its submitterLimit.
12/13/12 09:30:54 resources used scheddUsed= 80.000000
12/13/12 09:30:54 negotiateWithGroup resources used scheddAds length 1
12/13/12 09:30:54 ---------- Finished Negotiation Cycle ----------
[at this time 39 jobs flock into a different pool]
12/13/12 09:31:54 ---------- Started Negotiation Cycle ----------
12/13/12 09:31:54 Phase 1: Obtaining ads from collector ...
12/13/12 09:31:54 Getting Scheduler, Submitter and Machine ads ...
12/13/12 09:31:55 Sorting 23 ads ...
12/13/12 09:31:55 Getting startd private ads ...
12/13/12 09:31:55 Got ads: 23 public and 20 private
12/13/12 09:31:55 Public ads include 1 submitter, 20 startd
12/13/12 09:31:55 Phase 2: Performing accounting ...
12/13/12 09:31:55 Phase 3: Sorting submitter ads by priority ...
12/13/12 09:31:55 Phase 4.1: Negotiating with schedds ...
12/13/12 09:31:55 numSlots = 20
12/13/12 09:31:55 slotWeightTotal = 80.000000
12/13/12 09:31:55 pieLeft = 60.000
12/13/12 09:31:55 NormalFactor = 1.000000
12/13/12 09:31:55 MaxPrioValue = 1.820312
12/13/12 09:31:55 NumSubmitterAds = 1
12/13/12 09:31:55 Negotiating with [...]
12/13/12 09:31:55 0 seconds so far
12/13/12 09:31:55 Calculating submitter limit with the following parameters
12/13/12 09:31:55 SubmitterPrio = 1.820312
12/13/12 09:31:55 SubmitterPrioFactor = 1.000000
12/13/12 09:31:55 submitterShare = 1.000000
12/13/12 09:31:55 submitterAbsShare = 1.000000
12/13/12 09:31:55 submitterLimit = 60.000000
12/13/12 09:31:55 submitterUsage = 20.000000
12/13/12 09:31:55 Socket to [...] already in cache, reusing
[Well, I don't understand condor! A minute ago there were no resources left but now we have!]
[Is it because of partitionable slot?]
12/13/12 09:31:55 Got JOB_INFO command; getting classad/eom
12/13/12 09:31:55 Request 00895.00039:
12/13/12 09:31:55 matchmakingAlgorithm: limit 60.000000 used 0.000000 pieLeft 60.000000
12/13/12 09:31:55 Start of sorting MatchList (len=10)
12/13/12 09:31:55 Finished sorting MatchList
12/13/12 09:31:55 Matched 895.39 [...] preempting none slot1@v160.[...]
12/13/12 09:31:55 Notifying the accountant
12/13/12 09:31:55 Successfully matched with slot1@v160.[...]
12/13/12 09:31:55 Sending SEND_JOB_INFO/eom
12/13/12 09:31:55 Got NO_MORE_JOBS; done negotiating
12/13/12 09:31:55 Submitter [...] got all it wants; removing it.
12/13/12 09:31:55 resources used by [...] are 26.000000
12/13/12 09:31:55 resources used scheddUsed= 26.000000
12/13/12 09:31:55 negotiateWithGroup resources used scheddAds length 0
12/13/12 09:31:55 ---------- Finished Negotiation Cycle ----------