[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Questions about GCB



Hi,

I have a few questions about GCB.

Background:
Here at UofC we are trying to implement a homogenous condor cluster of Linux compute machines on top of Windows hosts using virtualization. The virtual machines communicate to the real world via a NAT network of one machine between the host and the guest OS.

The problem that we are having is with the GCB machine. It seems to just drop all of the connections between the execute only machines and the collector. The funny thing is that there are still TCP and UDP connections open to both the nodes and the collector when viewed from the GCB machine using netstat. The number of connections per execute only machine is in the 10-30 range and there is only ~20 machines at the moment (we're working on >2000 over the next year or so). The only way to get the nodes to reconnect is to kill all gcb processes and restart. Then the nodes will gradually find the collector again.

This happens under moderately high job turnaround but the number of connections being created on the GCB machine is considerably lower than the Linux kernel maximums in /proc.

End Background

Finally my questions: Are we using GCB incorrectly? The execute machines make no connections to the Collector that I can see in netstat. Is GCB designed for only a few NAT networks of more than 1 machine?

Thanks

-Dave
===============================
Dave Schulz
Research Computing Services
Information Technologies
University of Calgary
dschulz@xxxxxxxxxxx
(403) 220-2102