Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Do the following settings remove all UDP communication from a Condor pool?
- Date: Mon, 15 Aug 2011 11:51:39 -0400
- From: Ian Chesal <ichesal@xxxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Do the following settings remove all UDP communication from a Condor pool?
On Monday, 15 August, 2011 at 11:43 AM, Dan Bradley wrote:
On 8/15/11 10:32 AM, Ian Chesal wrote:
On Monday, 15 August, 2011 at 10:59 AM, Dan Bradley
wrote:
Ian,
I believe the settings you mentioned will achieve what you
are trying to do. In 7.6, it should also be sufficient to
do this:
WANT_UDP_COMMAND_SOCKET = false
UPDATE_COLLECTOR_WITH_TCP = True
COLLECTOR_MAX_FILE_DESCRIPTORS = 3000
In 7.6, daemons that do not have a UDP port advertise this
fact in their address information. Therefore, it is not
necessary to fiddle with protocol knobs such as
SCHEDD_SEND_VACATE_VIA_TCP, because the client
automatically switches to TCP when it sees that the server
lacks a UDP port.
Thanks Dan! That's sufficient incentive for me to ensure
everything is 7.6.x in my pool then. Right now I'm running a
7.4.3 scheduler and CM, but 7.6.1 on the execute node.
For what it's worth, this is all to debug an issue I'm seeing
on a large CPU count Windows 2k8 Server machine. It has 40
physical cores but Condor only seems to be able to utilize 12
slots on the box before it starts to fail to accept claims from
the shadows with:
08/10/11 18:29:50 Received TCP command 444 (ACTIVATE_CLAIM)
from unauthenticated@unmapped <10.78.194.211:40724>,
access level DAEMON
08/10/11 18:29:50 Calling HandleReq
<command_activate_claim> (0)
08/10/11 18:29:50 slot25: Got activate_claim request from
shadow (<10.78.194.211:40724>)
08/10/11 18:30:06 condor_write(): Socket closed when trying
to write 13 bytes to <10.78.194.211:40724>, fd is 1356
08/10/11 18:30:06 Buf::write(): condor_write() failed
08/10/11 18:30:06 slot25: Can't send eom to shadow.
08/10/11 18:30:06 Return from HandleReq
<command_activate_claim> (handler: 15.615s, sec: 0.016s)
What appears in ShadowLog during the time between 18:29:50 and
18:30:06?
Unfortunately I don't have the ShadowLog for that specific point in time. I'll recreate the failure this afternoon though and get you both sides of the communication channel. Going from memory, the ShadowLog was approximately:
- Failed to receive EOM from startd at <IP>
- Socket closed during read
- Sending deactivate claim forcibly message to <IP>
- Shadow exit
This happens with any schedd I try, running on either Linux and Windows. Both schedds had only a handful of jobs in the queue all set to run against this one 40-core machine. The Linux scheduler I was using typically handles 500+ active, running jobs without breaking a sweat.
My call late last week looking for anyone running large-core count Windows boxes was to try and ascertain if Condor has even been scaled up to this level on Windows execute nodes yet. I don't see any large Windows machines in the pools who make their collector data public. Seems to be dominated by Linux and top out around 24 cores.
I realize that's TCP traffic but I was wondering if UDP
communications from the shadows trying to get in touch with the
startd on the box were causing problems in general for the
Windows networking stack.
Since the address of the collector is hard-coded into
the configuration file, there are two options: one is to
use UPDATE_COLLECTOR_WITH_TCP, as in the above example.
The other is to add the "noUDP" flag to the collector
address. Example:
COLLECTOR_HOST =
mycollector.host.name:9618?noUDP
I've still got 7.4.3 execute nodes in the pool -- will they
they recognize this flag?
7.4 daemons will ignore this flag. Daemons from before 7.3 will
likely not accept this as a valid address.
Excellent.
Regards,
- Ian
---
Ian Chesal
Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools
http://www.cyclecomputing.com
http://www.cyclecloud.com
http://twitter.com/cyclecomputing