Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [condor-users] Network in Linux-Cluster and MPI

Date: Mon, 27 Oct 2003 18:06:19 +0100 (CET)
From: Degi Baatartsogt <baatarts@xxxxxxxxxxxxxxxxx>
Subject: Re: [condor-users] Network in Linux-Cluster and MPI

On Mon, 27 Oct 2003 marks@xxxxxxxxxxxxxxxxxxxxxxx wrote:

> I think that if all your cluster computers are connected to both networks, it
> would be enough to use Condor with one of them.

Our cluster computers connected only into host computer ipc654. And only
"ipc654" connected to the outside. So only "ipc654" can take contact to
condor host "isun01".

> You should put the IP of the interface, which is connected to the network with
> all computers. For instance, you have 192.168.10.* for all your comps, so you
> should put, say 192.168.10.1 for the first and so on.

But i reconfigured as u said here. And now i can see that they can
communicate with each other. I mean, with the command condor_status i get
following information.
Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

anne          LINUX       INTEL  Owner      Idle       0.000   501  0+00:10:11
bine          LINUX       INTEL  Owner      Idle       0.000   501  0+00:10:11
carmen        LINUX       INTEL  Owner      Idle       0.000   501  0+00:10:11
dana          LINUX       INTEL  Owner      Idle       0.000   501  0+00:10:10
emma          LINUX       INTEL  Owner      Idle       0.000   501  0+00:10:11
franzi        LINUX       INTEL  Owner      Idle       0.060   501  0+00:10:11
grace         LINUX       INTEL  Owner      Idle       0.000   501  0+00:10:11
vm1@xxxxxxxxx LINUX       INTEL  Owner      Idle       0.070   503  0+00:15:09
vm2@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   503  0+00:15:05
vm1@xxxxxxxxx SOLARIS28   SUN4u  Owner      Idle       0.000   512  0+00:40:07
vm2@xxxxxxxxx SOLARIS28   SUN4u  Unclaimed  Idle       0.000   512  0+00:40:05
vm3@xxxxxxxxx SOLARIS28   SUN4u  Unclaimed  Idle       0.000   512  0+00:40:06
vm4@xxxxxxxxx SOLARIS28   SUN4u  Unclaimed  Idle       0.000   512  0+00:40:07
vm5@xxxxxxxxx SOLARIS28   SUN4u  Unclaimed  Idle       0.000   512  0+00:40:08
vm6@xxxxxxxxx SOLARIS28   SUN4u  Unclaimed  Idle       0.000   512  0+00:40:09
isun25        SOLARIS28   SUN4u  Unclaimed  Idle       0.086    64  0+00:49:54
isun26        SOLARIS28   SUN4u  Unclaimed  Idle       0.008    64  0+00:50:04
isun28        SOLARIS28   SUN4u  Unclaimed  Idle       0.000    64  0+01:50:05
isun35        SOLARIS28   SUN4u  Unclaimed  Idle       0.000   128  0+03:40:05
isun09        SOLARIS28   SUN4x  Unclaimed  Idle       0.008    64  0+00:49:02
isun22        SOLARIS28   SUN4x  Unclaimed  Idle       0.016    64  0+01:35:04
isun23        SOLARIS28   SUN4x  Unclaimed  Idle       0.004    64  0+01:50:04

                     Machines Owner Claimed Unclaimed Matched Preempting

         INTEL/LINUX        9     8       0         1       0          0
     SUN4u/SOLARIS28       10     1       0         9       0          0
     SUN4x/SOLARIS28        3     0       0         3       0          0

               Total       22     9       0        13       0          0

And now im trying to execute jobs and the jobs are running on the mashine
where it were submitted, but not on remote mashine. Do u know what is the
Porblem? Following example is the submit file submitted on "isun01" for
remote mashine. I have both execute files on "isun01".

-----------------------------------------------------------
################
#
# Condor submit file for simple test job example
#
################

Universe        = vanilla
Executable      = hello.$$(OpSys).$$(Arch)

Requirements    =  (Arch == "INTEL" && OpSys == "LINUX")

tranfer_files = ALWAYS

input           = /dev/null
output          = het.out
error           = het.error
log             = het.log

Queue
-----------------------------------------------------------

Log files on "isun01" after executing the job 78.0 on "isun01"
-------------------------------------------------------------

==> condor/hosts/isun01/log/NegotiatorLog <==
10/27 17:55:46 Connect failed for 10 seconds; returning FALSE
10/27 17:55:46     Failed to connect to <0.0.0.0:33493>
10/27 17:55:46   Error: Ignoring schedd for this cycle
10/27 17:55:46   Negotiating with baatarts@xxxxxxxxxxxxxxx at <141.35.14.22:55627>
10/27 17:55:46     Request 00078.00000:
10/27 17:55:46       Matched 78.0 baatarts@xxxxxxxxxxxxxxx <141.35.14.22:55627> preempting none <0.0.0.0:33497>
10/27 17:55:46       Successfully matched with dana
10/27 17:55:46     Got NO_MORE_JOBS;  done negotiating
10/27 17:55:46 ---------- Finished Negotiation Cycle ----------

==> condor/hosts/isun01/log/SchedLog <==
10/27 17:55:46 Activity on stashed negotiator socket
10/27 17:55:46 Negotiating for owner: baatarts@xxxxxxxxxxxxxxx
10/27 17:55:46 Checking consistency running and runnable jobs
10/27 17:55:46 Tables are consistent
10/27 17:55:46 Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
10/27 17:55:46 Sent ad to central manager for baatarts@xxxxxxxxxxxxxxx
10/27 17:55:46 Can't connect to <0.0.0.0:33497>:0, errno = 146
10/27 17:55:46 Will keep trying for 10 seconds...
10/27 17:55:56 Connect failed for 10 seconds; returning FALSE
10/27 17:55:56 Couldn't send REQUEST_CLAIM to startd at <0.0.0.0:33497>
10/27 17:55:56 Sent RELEASE_CLAIM to startd on <0.0.0.0:33497>
10/27 17:55:56 Match record (<0.0.0.0:33497>, 78, 0) deleted

==> condor/hosts/isun01/log/MatchLog <==
10/27 17:55:46       Matched 78.0 baatarts@xxxxxxxxxxxxxxx
<141.35.14.22:55627> preempting none <0.0.0.0:33497>

==> condor/hosts/isun01/log/CollectorLog <==
10/27 17:55:54 (Sent 59 ads in response to query)
10/27 17:55:54 DaemonCore: PERMISSION DENIED to unknown user from host
<141.35.14.189:34481> for command 10 (QUERY_STARTD_PVT_ADS)



Log file on "ipc654" after executing the job 78.0 on "isun01"
-------------------------------------------------------------

----0/27 17:50:54 ---------- Started Negotiation Cycle ----------
10/27 17:50:54 Phase 1:  Obtaining ads from collector ...
10/27 17:50:54   Getting all public ads ...
10/27 17:50:54   Sorting 56 ads ...
10/27 17:50:54   Getting startd private ads ...
10/27 17:50:54 Couldn't fetch ads: communication error
10/27 17:50:54 Aborting negotiation cycle


> If you have two NON-interconnected networks of SUN and LINUX computers, you
> should setup a gateway as a router, which would forward packets from SUN to
> Linux and back in a transparent manner(from the application point of view), and
> afterwards setup Condor to be on that network, as specified above.
> Mark
>
> Quoting Degi Baatartsogt <baatarts@xxxxxxxxxxxxxxxxx>:
>
> >
> > Hi Mark,
> >
> > my problem is, that we have here Linux-Cluster (Beowulf). So our Linux
> > Host has two Interfaces. Thatsway i'm trying to use NETWORK_INTEFACE.  I'm
> > not sure what kind of address i should use. But i tried all possibilities.
> > But as i understand, we cant solve this problem till we get the source
> > codes. Is that right?
> >
> > On 23 Oct 2003, Mark Silberstein wrote:
> >
> > > Well, I would not mix these two things.
> > > Why do you use 0.0.0.0 settings for NETWORK_INTERFACE? If you have Linux
> > > and SUN pools connected in any way via network, you should not need to
> > > configure Condor to listen on more than one NW interface. Can you be
> > > more specific about your network topology to understand this?
> > > I expect that you would get the same communication problem for whatever
> > > job you run, since ALL Condor communications would not work with
> > > NETWORK_INTERFACE parameter set to 0.0.0.0
> > >
> > >
> > > On Sun, 2003-10-19 at 17:10, Degi Baatartsogt wrote:
> > > > Hi Mark,
> > > >
> > > > thank you for your response!
> > > >
> > > > > Sorry, from our experience this won't work. Condor can't really
> > listen
> > > > > on more than one NW interface, at least we did not succeed. If
> > someone
> > > > > from the team knows the answer, please share it with us!
> > > > > Mark
> > > >
> > > > Does it mean, that MPI-Condor-Jobs would'nt work on Cluster? Because i
> > get
> > > > also the same Communication Problem if i submit MPI-MPICH job on Condor
> > in
> > > > our Cluster.
> > > >
> > > > Degi
> > > >
> > > > > On Wed, 2003-10-15 at 14:58, Degi Baatartsogt wrote:
> > > > > > Hello everybody,
> > > > > >
> > > > > > I'm trying to use flocking between Sun pool and Linux pool. For that
> > i
> > > > > > changed flocking paramenter in both direction and put
> > NETWORK_INTERFACE in
> > > > > > 0.0.0.0 in global config file. And now i get following messages in
> > Log
> > > > > > files. Does anybody know, what should i do?
> > > > > >
> > > > > > Thank you
> > > > > > Ms Baatartsogt
> > > > > >
> > > > > > ==> SchedLog <==
> > > > > > 10/15 12:37:59 DaemonCore: Command received via UDP from host
> > <127.0.0.1:yyyyy>
> > > > > > 10/15 12:37:59 DaemonCore: received command 421 (RESCHEDULE),
> > calling
> > > > > >                handler (reschedule_negotiator)
> > > > > > 10/15 12:37:59 Sent ad to central manager for
> > condor@xxxxxxxxxxxxxxxxxx
> > > > > > 10/15 12:37:59 Called reschedule_negotiator()
> > > > > > 10/15 12:37:59 DaemonCore: PERMISSION DENIED to unknown user from
> > host
> > > > > >                <127.0.0.1:xxxxx> for command 416 (NEGOTIATE)
> > > > > >
> > > > > > ==> CollectorLog <==
> > > > > > 10/15 12:38:05 DC_AUTHENTICATE: attempt to open invalid session
> > ipc654:15713:106
> > > > > > 6213385:334, failing.
> > > > > > 10/15 12:38:12 DC_AUTHENTICATE: attempt to open invalid session
> > ipc654:15713:106
> > > > > > 6213692:349, failing.
> > > > > > 10/15 12:38:17 DC_AUTHENTICATE: attempt to open invalid session
> > ipc654:15713:106
> > > > > > ...
> > > > > > Condor Support Information:
> > > > > > http://www.cs.wisc.edu/condor/condor-support/
> > > > > > To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> > > > > > unsubscribe condor-users <your_email_address>
>

--------------------------------------
| Baatartsogt, O                       |
| University of Jena, Germany          |
 --------------------------------------

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>
Follow-Ups:
- Re: [condor-users] Network in Linux-Cluster and MPI
  - From: Mark Silberstein
References:
- Re: [condor-users] Network in Linux-Cluster and MPI
  - From: marks
Prev by Date: Re: [condor-users] Network in Linux-Cluster and MPI
Next by Date: [condor-users] Fwd: understanding RANK, user priorities, and exemption
Previous by thread: Re: [condor-users] Network in Linux-Cluster and MPI
Next by thread: Re: [condor-users] Network in Linux-Cluster and MPI
Index(es):
- Date
- Thread
Mailing List Archives

Authenticated access

Re: [condor-users] Network in Linux-Cluster and MPI