Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] negotiating with schedds when a client has FW
- Date: Fri, 17 Jun 2005 16:10:44 +0200
- From: Thomas Lisson <lisson@xxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] negotiating with schedds when a client has FW
Hello,
$CondorVersion: 6.7.6 Mar 15 2005 $
$CondorPlatform: I386-LINUX_RH9 $
I just wondered why my machines were'nt claimed even they were unclaimed
and they had all requirements.
IA64/LINUX 24 24 0 0 0 0
INTEL/LINUX 60 6 1 53 0 0
INTEL/WINNT50 2 0 0 2 0 0
INTEL/WINNT51 163 0 2 161 0 0
x86_64/LINUX 1 1 0 0 0 0
Total 250 31 3 216 0 0
6907.002: Run analysis summary. Of 250 machines,
25 are rejected by your job's requirements
6 reject your job because of their own requirements
3 match but are serving users with a better priority in the pool
216 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 are available to run your job
[...]
6907.019: Run analysis summary. Of 250 machines,
25 are rejected by your job's requirements
6 reject your job because of their own requirements
3 match but are serving users with a better priority in the pool
216 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 are available to run your job
I took a look at the Negotiator log:
6/16 13:54:33 ---------- Started Negotiation Cycle ----------
6/16 13:54:33 Phase 1: Obtaining ads from collector ...
6/16 13:54:33 Getting all public ads ...
6/16 13:54:33 Sorting 366 ads ...
6/16 13:54:33 Getting startd private ads ...
6/16 13:54:33 Got ads: 366 public and 250 private
6/16 13:54:33 Public ads include 1 submitter, 250 startd
6/16 13:54:33 Phase 2: Performing accounting ...
6/16 13:54:33 Phase 3: Sorting submitter ads by priority ...
6/16 13:54:33 Phase 4.1: Negotiating with schedds ...
6/16 13:54:33 Negotiating with nobody@*** at <***.130.4.77:9601>
6/16 13:54:33 Request 06907.00000:
6/16 13:54:33 Matched 6907.0 nobody@*** <***.130.4.77:9601>
preempting none <***.130.71.149:9620>
6/16 13:54:33 Successfully matched with vm1@pc49.***
6/16 13:54:33 Request 06907.00001:
6/16 13:54:33 Matched 6907.1 nobody@*** <***.130.4.77:9601>
preempting none <***.130.71.149:9620>
6/16 13:54:33 Successfully matched with vm2@pc49.***
6/16 13:54:33 Request 06907.00002:
6/16 13:57:42 Can't connect to <***.130.71.139:10066>:0, errno = 110
6/16 13:57:42 Will keep trying for 10 seconds...
6/16 13:57:43 Connect failed for 10 seconds; returning FALSE
6/16 13:57:43 ERROR: SECMAN:2003:TCP connection to
<***.130.71.139:10066> failed
6/16 13:57:43 condor_write(): Socket closed when trying to write buffer
6/16 13:57:43 Buf::write(): condor_write() failed
6/16 13:57:43 Could not send PERMISSION
6/16 13:57:43 Error: Ignoring schedd for this cycle
6/16 13:57:43 ---------- Finished Negotiation Cycle ----------
I checked ***.130.71.139 and noticed that the machine had a
disfunctional network service - all requests were blocked although the
machine (win xp) told me, the FW is off.
OK, lets assume ***.130.71.139 blocks every incoming traffic, but why
aren't all the other jobs serviced (6907.002-6907.019) in the same cycle?
This job (6907) was finished after a while - but other entries in
NegotiatorLog and MatchLog for that job weren't complete. Some processes
of that cluster were serviced but not logged - maybe a bug.
My jobs have rank = kflops in the submit files. The machine
***.130.71.139 is one of the fastest (4th) and so condor tried to claim
that machine in every negotiation cycle first, because the 3 fastest
machines were already claimed. But that machine blocked all traffic, so
condor stopped matchmaking and didn't look at the next free machine. So
my whole cluster was only serviced by my 3 fastest machines - out of a
pool with 216 other machines that matched and had nothing to do. That
took a long time ;)
Suggestion: If Condor can't connect to a machine, it schould claim the
next best free machine for a job instead of exit the cycle. Network
problems could have big negative effects on the whole condor pool else.
regards
Thomas Lisson
NRW-Grid