Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [condor-users] Network in Linux-Cluster and MPI
- Date: Mon, 27 Oct 2003 18:06:19 +0100 (CET)
- From: Degi Baatartsogt <baatarts@xxxxxxxxxxxxxxxxx>
- Subject: Re: [condor-users] Network in Linux-Cluster and MPI
On Mon, 27 Oct 2003 marks@xxxxxxxxxxxxxxxxxxxxxxx wrote:
> I think that if all your cluster computers are connected to both networks, it
> would be enough to use Condor with one of them.
Our cluster computers connected only into host computer ipc654. And only
"ipc654" connected to the outside. So only "ipc654" can take contact to
condor host "isun01".
> You should put the IP of the interface, which is connected to the network with
> all computers. For instance, you have 192.168.10.* for all your comps, so you
> should put, say 192.168.10.1 for the first and so on.
But i reconfigured as u said here. And now i can see that they can
communicate with each other. I mean, with the command condor_status i get
following information.
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
anne LINUX INTEL Owner Idle 0.000 501 0+00:10:11
bine LINUX INTEL Owner Idle 0.000 501 0+00:10:11
carmen LINUX INTEL Owner Idle 0.000 501 0+00:10:11
dana LINUX INTEL Owner Idle 0.000 501 0+00:10:10
emma LINUX INTEL Owner Idle 0.000 501 0+00:10:11
franzi LINUX INTEL Owner Idle 0.060 501 0+00:10:11
grace LINUX INTEL Owner Idle 0.000 501 0+00:10:11
vm1@xxxxxxxxx LINUX INTEL Owner Idle 0.070 503 0+00:15:09
vm2@xxxxxxxxx LINUX INTEL Unclaimed Idle 0.000 503 0+00:15:05
vm1@xxxxxxxxx SOLARIS28 SUN4u Owner Idle 0.000 512 0+00:40:07
vm2@xxxxxxxxx SOLARIS28 SUN4u Unclaimed Idle 0.000 512 0+00:40:05
vm3@xxxxxxxxx SOLARIS28 SUN4u Unclaimed Idle 0.000 512 0+00:40:06
vm4@xxxxxxxxx SOLARIS28 SUN4u Unclaimed Idle 0.000 512 0+00:40:07
vm5@xxxxxxxxx SOLARIS28 SUN4u Unclaimed Idle 0.000 512 0+00:40:08
vm6@xxxxxxxxx SOLARIS28 SUN4u Unclaimed Idle 0.000 512 0+00:40:09
isun25 SOLARIS28 SUN4u Unclaimed Idle 0.086 64 0+00:49:54
isun26 SOLARIS28 SUN4u Unclaimed Idle 0.008 64 0+00:50:04
isun28 SOLARIS28 SUN4u Unclaimed Idle 0.000 64 0+01:50:05
isun35 SOLARIS28 SUN4u Unclaimed Idle 0.000 128 0+03:40:05
isun09 SOLARIS28 SUN4x Unclaimed Idle 0.008 64 0+00:49:02
isun22 SOLARIS28 SUN4x Unclaimed Idle 0.016 64 0+01:35:04
isun23 SOLARIS28 SUN4x Unclaimed Idle 0.004 64 0+01:50:04
Machines Owner Claimed Unclaimed Matched Preempting
INTEL/LINUX 9 8 0 1 0 0
SUN4u/SOLARIS28 10 1 0 9 0 0
SUN4x/SOLARIS28 3 0 0 3 0 0
Total 22 9 0 13 0 0
And now im trying to execute jobs and the jobs are running on the mashine
where it were submitted, but not on remote mashine. Do u know what is the
Porblem? Following example is the submit file submitted on "isun01" for
remote mashine. I have both execute files on "isun01".
-----------------------------------------------------------
################
#
# Condor submit file for simple test job example
#
################
Universe = vanilla
Executable = hello.$$(OpSys).$$(Arch)
Requirements = (Arch == "INTEL" && OpSys == "LINUX")
tranfer_files = ALWAYS
input = /dev/null
output = het.out
error = het.error
log = het.log
Queue
-----------------------------------------------------------
Log files on "isun01" after executing the job 78.0 on "isun01"
-------------------------------------------------------------
==> condor/hosts/isun01/log/NegotiatorLog <==
10/27 17:55:46 Connect failed for 10 seconds; returning FALSE
10/27 17:55:46 Failed to connect to <0.0.0.0:33493>
10/27 17:55:46 Error: Ignoring schedd for this cycle
10/27 17:55:46 Negotiating with baatarts@xxxxxxxxxxxxxxx at <141.35.14.22:55627>
10/27 17:55:46 Request 00078.00000:
10/27 17:55:46 Matched 78.0 baatarts@xxxxxxxxxxxxxxx <141.35.14.22:55627> preempting none <0.0.0.0:33497>
10/27 17:55:46 Successfully matched with dana
10/27 17:55:46 Got NO_MORE_JOBS; done negotiating
10/27 17:55:46 ---------- Finished Negotiation Cycle ----------
==> condor/hosts/isun01/log/SchedLog <==
10/27 17:55:46 Activity on stashed negotiator socket
10/27 17:55:46 Negotiating for owner: baatarts@xxxxxxxxxxxxxxx
10/27 17:55:46 Checking consistency running and runnable jobs
10/27 17:55:46 Tables are consistent
10/27 17:55:46 Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
10/27 17:55:46 Sent ad to central manager for baatarts@xxxxxxxxxxxxxxx
10/27 17:55:46 Can't connect to <0.0.0.0:33497>:0, errno = 146
10/27 17:55:46 Will keep trying for 10 seconds...
10/27 17:55:56 Connect failed for 10 seconds; returning FALSE
10/27 17:55:56 Couldn't send REQUEST_CLAIM to startd at <0.0.0.0:33497>
10/27 17:55:56 Sent RELEASE_CLAIM to startd on <0.0.0.0:33497>
10/27 17:55:56 Match record (<0.0.0.0:33497>, 78, 0) deleted
==> condor/hosts/isun01/log/MatchLog <==
10/27 17:55:46 Matched 78.0 baatarts@xxxxxxxxxxxxxxx
<141.35.14.22:55627> preempting none <0.0.0.0:33497>
==> condor/hosts/isun01/log/CollectorLog <==
10/27 17:55:54 (Sent 59 ads in response to query)
10/27 17:55:54 DaemonCore: PERMISSION DENIED to unknown user from host
<141.35.14.189:34481> for command 10 (QUERY_STARTD_PVT_ADS)
Log file on "ipc654" after executing the job 78.0 on "isun01"
-------------------------------------------------------------
----0/27 17:50:54 ---------- Started Negotiation Cycle ----------
10/27 17:50:54 Phase 1: Obtaining ads from collector ...
10/27 17:50:54 Getting all public ads ...
10/27 17:50:54 Sorting 56 ads ...
10/27 17:50:54 Getting startd private ads ...
10/27 17:50:54 Couldn't fetch ads: communication error
10/27 17:50:54 Aborting negotiation cycle
> If you have two NON-interconnected networks of SUN and LINUX computers, you
> should setup a gateway as a router, which would forward packets from SUN to
> Linux and back in a transparent manner(from the application point of view), and
> afterwards setup Condor to be on that network, as specified above.
> Mark
>
> Quoting Degi Baatartsogt <baatarts@xxxxxxxxxxxxxxxxx>:
>
> >
> > Hi Mark,
> >
> > my problem is, that we have here Linux-Cluster (Beowulf). So our Linux
> > Host has two Interfaces. Thatsway i'm trying to use NETWORK_INTEFACE. I'm
> > not sure what kind of address i should use. But i tried all possibilities.
> > But as i understand, we cant solve this problem till we get the source
> > codes. Is that right?
> >
> > On 23 Oct 2003, Mark Silberstein wrote:
> >
> > > Well, I would not mix these two things.
> > > Why do you use 0.0.0.0 settings for NETWORK_INTERFACE? If you have Linux
> > > and SUN pools connected in any way via network, you should not need to
> > > configure Condor to listen on more than one NW interface. Can you be
> > > more specific about your network topology to understand this?
> > > I expect that you would get the same communication problem for whatever
> > > job you run, since ALL Condor communications would not work with
> > > NETWORK_INTERFACE parameter set to 0.0.0.0
> > >
> > >
> > > On Sun, 2003-10-19 at 17:10, Degi Baatartsogt wrote:
> > > > Hi Mark,
> > > >
> > > > thank you for your response!
> > > >
> > > > > Sorry, from our experience this won't work. Condor can't really
> > listen
> > > > > on more than one NW interface, at least we did not succeed. If
> > someone
> > > > > from the team knows the answer, please share it with us!
> > > > > Mark
> > > >
> > > > Does it mean, that MPI-Condor-Jobs would'nt work on Cluster? Because i
> > get
> > > > also the same Communication Problem if i submit MPI-MPICH job on Condor
> > in
> > > > our Cluster.
> > > >
> > > > Degi
> > > >
> > > > > On Wed, 2003-10-15 at 14:58, Degi Baatartsogt wrote:
> > > > > > Hello everybody,
> > > > > >
> > > > > > I'm trying to use flocking between Sun pool and Linux pool. For that
> > i
> > > > > > changed flocking paramenter in both direction and put
> > NETWORK_INTERFACE in
> > > > > > 0.0.0.0 in global config file. And now i get following messages in
> > Log
> > > > > > files. Does anybody know, what should i do?
> > > > > >
> > > > > > Thank you
> > > > > > Ms Baatartsogt
> > > > > >
> > > > > > ==> SchedLog <==
> > > > > > 10/15 12:37:59 DaemonCore: Command received via UDP from host
> > <127.0.0.1:yyyyy>
> > > > > > 10/15 12:37:59 DaemonCore: received command 421 (RESCHEDULE),
> > calling
> > > > > > handler (reschedule_negotiator)
> > > > > > 10/15 12:37:59 Sent ad to central manager for
> > condor@xxxxxxxxxxxxxxxxxx
> > > > > > 10/15 12:37:59 Called reschedule_negotiator()
> > > > > > 10/15 12:37:59 DaemonCore: PERMISSION DENIED to unknown user from
> > host
> > > > > > <127.0.0.1:xxxxx> for command 416 (NEGOTIATE)
> > > > > >
> > > > > > ==> CollectorLog <==
> > > > > > 10/15 12:38:05 DC_AUTHENTICATE: attempt to open invalid session
> > ipc654:15713:106
> > > > > > 6213385:334, failing.
> > > > > > 10/15 12:38:12 DC_AUTHENTICATE: attempt to open invalid session
> > ipc654:15713:106
> > > > > > 6213692:349, failing.
> > > > > > 10/15 12:38:17 DC_AUTHENTICATE: attempt to open invalid session
> > ipc654:15713:106
> > > > > > ...
> > > > > > Condor Support Information:
> > > > > > http://www.cs.wisc.edu/condor/condor-support/
> > > > > > To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> > > > > > unsubscribe condor-users <your_email_address>
>
--------------------------------------
| Baatartsogt, O |
| University of Jena, Germany |
--------------------------------------
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>