[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] gateway



You need to have network access from ALL submit nodes to ALL execute nodes and vice-versa, for
udp and tcp, over some specific ports and for an advertised high port range. Communication from submit to
central node is not sufficient.
 
http://epubs.cclrc.ac.uk/bitstream/919/431.pdf
 
has more details.
 
You could open up the firewall(s) for all these ports on all the execute nodes and submit nodes,
or you could look into using a GCB (see another thread - that was about a private network, but the
same considerations apply).
 
Cheers
 
JK
-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Masao Fujinaga
Sent: Thursday, October 04, 2007 3:24 PM
To: Condor-Users Mail List
Subject: [Condor-users] gateway

We are having problems getting jobs submitted from a linux submit host to a windows lab behind a gateway.  On the windows machine, we have errors in the starter log as follows:

0/3 19:10:35 Communicating with shadow <129.128.125.15:37473>
10/3 19:10:35 Submitting machine is "opteron-cluster.nic.ualberta.ca"
10/3 19:12:34 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <129.128.125.15:55548>.
10/3 19:12:34 ERROR "Assertion ERROR on (result)" at line 113 in file ..\src\condor_starter.V6.1\NTsenders.C
10/3 19:12:34 ERROR "LocalUserLog::logStarterError() called before init()" at line 205 in file ..\src\condor_starter.V6.1\local_user_log.C

On the submit node, in the shadow log,

0/3 19:16:58 Initializing a VANILLA shadow for job 85.0
10/3 19:17:18 (85.0) (13769): condor_read(): timeout reading 5 bytes from <129.128.237.81:1050>.
10/3 19:17:18 (85.0) (13769): Request to run on <129.128.237.81:1050> was ACCEPTED
10/3 19:18:06 (85.0) (13769): condor_read(): timeout reading 5 bytes from <129.128.237.81:1050>.
10/3 19:19:16 (85.0) (13769): condor_read(): recv() returned -1, errno = 104, assuming failure reading 5 bytes from unknown source.
10/3 19:19:16 (85.0) (13769): ERROR "Can no longer talk to condor_starter <129.128.237.81:1050>" at line 123 in file NTreceivers.C

We have put in holes in the gateway so that there is communication between the lab and the submit host and the  central manager. We can ping between these machines without any problems and the collector gathers information about the available machines. However, there is something special about the submit-execute communication that seems to be blocked by the gateway. If the gateway is opened up, everything works fine.
Is there anything we can change to condor or to the gateway to make this work?

Thanks for your time.

Masao



--

Masao Fujinaga         

fujinaga@xxxxxxxxxxx    Tel.: (780) 492-2117  Fax.: (780) 492-1729

Research Computing Support

Academic Information and Communication Technologies (AICT)  

University of Alberta, Edmonton, Alberta, CANADA T6G 2H1


This communication is intended for the use of the recipient to which it is addressed, and may
contain confidential, personal, and/or privileged information.  Please contact us immediately 
if you are not the intended recipient of this communication.  If you are not the intended recipient 
of this communication, do not copy, distribute, or take action on it. Any communication received 
in error, or subsequent reply, should be deleted or destroyed