[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Job remains in idle (worked until I increased pool size)
- Date: Wed, 9 Feb 2011 12:15:26 -0700
- From: "Michael O'Donnell" <odonnellm@xxxxxxxx>
- Subject: [Condor-users] Job remains in idle (worked until I increased pool size)
I have a pool of 200 cores with various
OS for Windows machines. Yesterday, I expanded the pool from 100 cores
to 200 cores and since than any jobs that I submit remain in the idle state.
I have evaluated the SchedLod and NegotiatorLog and there is no indication
that a problem exists. I have tried to remove machines from the pool, but
the problem persists. Does anyone have any ideas as to what the problem
might be. I am testing a job that I know will work (ran on friday)?
thank you,
Mike
condor_q -better-analyze:
005.000: Run analysis summary.
Of 203 machines,
203 are rejected by your
job's requirements
0 reject your job
because of their own requirements
0 match but are
serving users with a better priority in the pool
0 match but reject
the job for unknown reasons
0 match but will
not currently preempt their existing job
0 match but are
currently offline
0 are available
to run your job
No successful
match recorded.
Last failed
match: Wed Feb 09 11:58:49 2011
Reason for
last match failure: no match found
WARNING: Be advised:
No resources matched request's
constraints
The Requirements _expression_ for your
job is:
( ( target.OpSys == "WINNT51"
|| target.OpSys == "WINNT61" || target.OpSys == "WINNT60"
) ) &&
( target.Arch == "INTEL" )
&& ( target.Disk >= DiskUsage ) &&
( ( ( target.Memory * 1024 ) >= ImageSize
) &&
( ( RequestMemory * 1024 ) >= ImageSize
) ) && ( target.HasFileTransfer ) &&
( target.HasWindowsRunAsOwner &&
( target.LocalCredd is "CM"
) )
Condition
Machines
Matched Suggestion
---------
----------------
----------
1 ( ( ( 1024 * target.Memory
) >= 4500 ) && ( ( 1024 * ceiling(ifThenElse(JobVMMemory isnt
undefined,JobVMMemory,4.394
531250000000E+000)) ) >= 4500 ) )
0
REMOVE
2 ( target.HasWindowsRunAsOwner
&& ( target.LocalCredd is "CM" ) )
0
REMOVE
3 ( ( target.OpSys == "WINNT51"
|| target.OpSys == "WINNT61" || target.OpSys == "WINNT60"
) )
203
4 ( target.Arch == "INTEL"
) 203
5 ( target.Disk >= 4500 )
203
6 ( target.HasFileTransfer )
203
An excerpt from my SchedLog:
02/09 10:56:18 (pid:2844) SECMAN: startCommand
succeeded.
02/09 10:56:18 (pid:2844) Authorizing
server '*/IP'.
02/09 10:56:18 (pid:2844) SEND [1000]
<IP> <IP>
02/09 10:56:18 (pid:2844) SEND [1000]
<IP> <IP>
02/09 10:56:18 (pid:2844) SEND [199]
<IP> <IP>
02/09 10:56:18 (pid:2844) Sent ad to
1 collectors for odonnellm@xxxxxxxxxx
02/09 10:56:18 (pid:2844) ============
Begin clean_shadow_recs =============
02/09 10:56:18 (pid:2844) ============
End clean_shadow_recs =============
02/09 10:56:18 (pid:2844) Return from
Timer handler 7 (timeout)
02/09 10:56:25 (pid:2844) Calling Handler
<<Negotiator Command>> (4)
02/09 10:56:25 (pid:2844) Activity on
stashed negotiator socket
02/09 10:56:25 (pid:2844) condor_read(fd=1228
<IP>,,size=21,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read():
fd=1228
02/09 10:56:25 (pid:2844) condor_read():
select returned 1
02/09 10:56:25 (pid:2844) condor_read(fd=1228
<IP>,,size=89,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read():
fd=1228
02/09 10:56:25 (pid:2844) condor_read():
select returned 1
02/09 10:56:25 (pid:2844)
02/09 10:56:25 (pid:2844) Entered negotiate
02/09 10:56:25 (pid:2844) ## 2. Negotiating
with CM
02/09 10:56:25 (pid:2844) *** SwapSpace
= 2147483647
02/09 10:56:25 (pid:2844) *** ReservedSwap
= 0
02/09 10:56:25 (pid:2844) *** Shadow
Size Estimate = 800
02/09 10:56:25 (pid:2844) *** Start
Limit For Swap = 2684354
02/09 10:56:25 (pid:2844) *** Current
num of active shadows = 0
02/09 10:56:25 (pid:2844) Negotiating
for owner: odonnellm@xxxxxxxxxx
02/09 10:56:25 (pid:2844) AutoCluster:config(JobUniverse,LastCheckpointPlatform,NumCkpts)
invoked
02/09 10:56:25 (pid:2844) Reusing prioritized
runnable job list because nothing has changed.
02/09 10:56:25 (pid:2844) Job 4.0: is
runnable
02/09 10:56:25 (pid:2844) condor_read(fd=1228
<IP>,,size=21,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read():
fd=1228
02/09 10:56:25 (pid:2844) condor_read():
select returned 1
02/09 10:56:25 (pid:2844) condor_read(fd=1228
<IP>,,size=8,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read():
fd=1228
02/09 10:56:25 (pid:2844) condor_read():
select returned 1
02/09 10:56:25 (pid:2844) condor_write(fd=1228
<IP2>,,size=4096,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_write(fd=1228
<IP>,,size=2077,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) Sent job 4.0
(autocluster=0)
02/09 10:56:25 (pid:2844) condor_read(fd=1228
<IP>,,size=21,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read():
fd=1228
02/09 10:56:25 (pid:2844) condor_read():
select returned 1
02/09 10:56:25 (pid:2844) condor_read(fd=1228
<IP>,,size=31,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read():
fd=1228
02/09 10:56:25 (pid:2844) condor_read():
select returned 1
02/09 10:56:25 (pid:2844) Job 4.0 rejected:
no match found
02/09 10:56:25 (pid:2844) condor_read(fd=1228
<IP>,,size=21,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read():
fd=1228
02/09 10:56:25 (pid:2844) condor_read():
select returned 1
02/09 10:56:25 (pid:2844) condor_read(fd=1228
<IP>,,size=8,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) condor_read():
fd=1228
02/09 10:56:25 (pid:2844) condor_read():
select returned 1
02/09 10:56:25 (pid:2844) condor_write(fd=1228
<IP>,,size=29,timeout=20,flags=0)
02/09 10:56:25 (pid:2844) Out of servers
- 0 jobs matched, 1 jobs idle, 1 jobs rejected
02/09 10:56:25 (pid:2844) Return from
Handler <<Negotiator Command>>
02/09 10:56:30 (pid:2844) Calling Timer
handler 316 (HandleFlushJobQueueLogTimer)
02/09 10:56:30 (pid:2844) Return from
Timer handler 316 (HandleFlushJobQueueLogTimer)
02/09 10:56:31 (pid:2844) ACCEPT from=<IP>
newfd=1276 to=<IP>
02/09 10:56:31 (pid:2844) condor_read(fd=1276
<IP>,,size=4,timeout=1,flags=2)
02/09 10:56:31 (pid:2844) condor_read():
fd=1276
02/09 10:56:31 (pid:2844) condor_read():
select returned 1
02/09 10:56:31 (pid:2844) condor_read(fd=1276
<IP>,,size=5,timeout=1,flags=0)
02/09 10:56:31 (pid:2844) condor_read():
fd=1276
02/09 10:56:31 (pid:2844) condor_read():
select returned 1
02/09 10:56:31 (pid:2844) condor_read(fd=1276
<IP>,,size=391,timeout=1,flags=0)
02/09 10:56:31 (pid:2844) condor_read():
fd=1276
02/09 10:56:31 (pid:2844) condor_read():
select returned 1
02/09 10:56:31 (pid:2844) DC_AUTHENTICATE:
received DC_AUTHENTICATE from <159.189.162.84:2749>
02/09 10:56:31 (pid:2844) DC_AUTHENTICATE:
received following ClassAd:
An excerpt from my NegotiatorLog:
02/09 12:08:43 condor_read(fd=596 collector
at <IP>,,size=3012,timeout=60,flags=0)
02/09 12:08:43 condor_read(): fd=596
02/09 12:08:43 condor_read(): select
returned 1
02/09 12:08:43 CLOSE <159.189.160.37:62623>
fd=596
02/09 12:08:43 Got ads: 292 public and
202 private
02/09 12:08:43 Public ads include 1
submitter, 202 startd
02/09 12:08:43 Entering compute_significant_attrs()
02/09 12:08:43 Leaving compute_significant_attrs()
- result=JobUniverse,LastCheckpointPlatform,NumCkpts
02/09 12:08:43 Phase 2: Performing
accounting ...
02/09 12:08:43 Phase 3: Sorting
submitter ads by priority ...
02/09 12:08:43 Phase 4.1: Negotiating
with schedds ...
02/09 12:08:43 numSlots
= 202
02/09 12:08:43 slotWeightTotal
= 202.000000
02/09 12:08:43 pieLeft
= 202.000
02/09 12:08:43 NormalFactor
= 1.000000
02/09 12:08:43 MaxPrioValue
= 0.500000
02/09 12:08:43 NumSubmitterAds
= 1
02/09 12:08:43 Negotiating with
odonnellm@xxxxxxxxxx at <IP>
02/09 12:08:43 0 seconds so far
02/09 12:08:43 Calculating submitter
limit with the following parameters
02/09 12:08:43 SubmitterPrio
= 0.500000
02/09 12:08:43 SubmitterPrioFactor
= 1.000000
02/09 12:08:43 submitterShare
= 1.000000
02/09 12:08:43 submitterAbsShare
= 1.000000
02/09 12:08:43 submitterLimit
= 202.000000
02/09 12:08:43 submitterUsage
= 0.000000
02/09 12:08:43 Socket to odonnellm@xxxxxxxxxx
(<IP>) already in cache, reusing
02/09 12:08:43 condor_write(fd=604 schedd
odonnellm@xxxxxxxxxx,,size=110,timeout=30,flags=0)
02/09 12:08:43 Sending
SEND_JOB_INFO/eom
02/09 12:08:43 condor_write(fd=604 schedd
odonnellm@xxxxxxxxxx,,size=29,timeout=30,flags=0)
02/09 12:08:43 Getting
reply from schedd ...
02/09 12:08:43 condor_read(fd=604 schedd
odonnellm@xxxxxxxxxx,,size=21,timeout=30,flags=0)
02/09 12:08:43 condor_read(): fd=604
02/09 12:08:43 condor_read(): select
returned 1
02/09 12:08:43 condor_read(fd=604 schedd
odonnellm@xxxxxxxxxx,,size=4075,timeout=30,flags=0)
02/09 12:08:43 condor_read(): fd=604
02/09 12:08:43 condor_read(): select
returned 1
02/09 12:08:43 condor_read(fd=604 schedd
odonnellm@xxxxxxxxxx,,size=21,timeout=30,flags=0)
02/09 12:08:43 condor_read(): fd=604
02/09 12:08:43 condor_read(): select
returned 1
02/09 12:08:43 condor_read(fd=604 schedd
odonnellm@xxxxxxxxxx,,size=2056,timeout=30,flags=0)
02/09 12:08:43 condor_read(): fd=604
02/09 12:08:43 condor_read(): select
returned 1
02/09 12:08:43 Got JOB_INFO
command; getting classad/eom
02/09 12:08:43 Request
00005.00000:
02/09 12:08:43 matchmakingAlgorithm:
limit 202.000000 used 0.000000 pieLeft 202.000000
02/09 12:08:43 FileLock::obtain(1) -
@1297278523.987000 lock on C:\Condor\log\negotiator_sql.log now WRITE
02/09 12:08:44 FileLock::obtain(2) -
@1297278524.003000 lock on C:\Condor\log\negotiator_sql.log now UNLOCKED
02/09 12:08:44
Rejected 5.0 odonnellm@xxxxxxxxxx <IP>: no match found
02/09 12:08:44 condor_write(fd=604 schedd
odonnellm@xxxxxxxxxx,,size=52,timeout=30,flags=0)
02/09 12:08:44 Sending
SEND_JOB_INFO/eom
02/09 12:08:44 condor_write(fd=604 schedd
odonnellm@xxxxxxxxxx,,size=29,timeout=30,flags=0)
02/09 12:08:44 Getting
reply from schedd ...
02/09 12:08:44 condor_read(fd=604 schedd
odonnellm@xxxxxxxxxx,,size=21,timeout=30,flags=0)
02/09 12:08:44 condor_read(): fd=604
02/09 12:08:44 condor_read(): select
returned 1
02/09 12:08:44 condor_read(fd=604 schedd
odonnellm@xxxxxxxxxx,,size=8,timeout=30,flags=0)
02/09 12:08:44 condor_read(): fd=604
02/09 12:08:44 condor_read(): select
returned 1
02/09 12:08:44 Got NO_MORE_JOBS;
done negotiating
02/09 12:08:44 Submitter odonnellm@xxxxxxxxxx
got all it wants; removing it.
02/09 12:08:44 ---------- Finished Negotiation
Cycle ----------