Hi All, I can't believe what I'm seeing here. My little mpi cluster on which I'm experimenting with GCB consists out of 10 exactly identical boxes. All equipped with onboard realtek Ethernet cards, the 10 machines boot from network, and are all supplied with the same image. Last week I surprised one of the 10 boxes with an Intel 10/100 card to do a little performance benchmarking. It just so happens that this box DOES start condor (configured to use GCB) correctly. All other boxes with the realtek cards fail. Can somebody please explain how the heck this is possible. I knew Realtek was crap, but this bad! I mean condor without GCB works like a charm on these boxes. I therefor find it hard to believe there's something physically wrong with these boxes. Is this a Condor issue or a driver issue, I'm lost... In case this is a condor thingy, I attached 2 MasterLog files, one from the machine with the Intel card, which successfully starts. And one from the Realtek machines. Kind Regards, Cor >> Cor Cornelisse <ccorneli@xxxxxxxx> wrote: >>> 12/7 22:11:45 GCB: GCB_bind: _myIP failed >> >> The most likely cause is that your machine (the one with the >> master) doesn't have any active IP addresses beyond loopback >> (127.0.0.1). That seems plausible on your laptop if you tried to >> start Condor before attaching to a network. >> >> That doesn't explain why you would see that error message on your >> execute nodes, which presumably are working fine. To take a wild >> guess, are you starting Condor in your init scripts? If so, is >> Condor possibly higher priority than initializing the network? >> Having Condor start before the network is up if a recipe for >> problems. >> >> If that's not the case for your execute nodes, you might want to >> double check that you're not seeing a different error. >> >> -- >> Alan De Smet Condor Project Research >> adesmet@xxxxxxxxxxx http://www.condorproject.org/ >> _______________________________________________ >> Condor-users mailing list >> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with >> a >> subject: Unsubscribe >> You can also unsubscribe by visiting >> https://lists.cs.wisc.edu/mailman/listinfo/condor-users >> >> The archives can be found at either >> https://lists.cs.wisc.edu/archive/condor-users/ >> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR >> > > Hi, > > I'm sure networking is up before condor. I do start the service through > init scripts, but to test if your hypothesis is correct, I simply > restarted the condor service, resulting in the same error. So the network > is definitly up and running. I set the masterlog debug option to D_ALL, > this gives a little more debug information, but still not enough for me to > understand what's going wrong (it looks like it's trying to bind to > 0.0.0.0 :s) > > Anyone? > > 12/8 18:29:12 (fd:3) (pid:4559) Using config source: > /opt/condor/etc/condor_config > 12/8 18:29:12 (fd:3) (pid:4559) Using local config sources: > 12/8 18:29:12 (fd:3) (pid:4559) /var/condor/condor_config.local > 12/8 18:29:12 (fd:5) (pid:4559) Attempting to lock > /tmp/condor-lock.portal0.998036533202143/InstanceLock. > 12/8 18:29:12 (fd:6) (pid:4559) Obtained lock on > /tmp/condor-lock.portal0.998036533202143/InstanceLock. > 12/8 18:29:12 (fd:6) (pid:4559) Setting up command socket > 12/8 18:29:12 (fd:6) (pid:4559) CONDOR_INHERIT: is NULL > 12/8 18:29:12 (fd:7) (pid:4559) GCB: GCB_socket(fd = 6, TCP) > 12/8 18:29:12 (fd:7) (pid:4559) PRIV_CONDOR --> PRIV_ROOT at sock.C:526 > 12/8 18:29:12 (fd:7) (pid:4559) GCB: GCB_bind(6[GCB_SOCKET], <0.0.0.0:0>) > 12/8 18:29:12 (fd:7) (pid:4559) GCB: GCB_bind: _myIP failed > 12/8 18:29:12 (fd:7) (pid:4559) PRIV_ROOT --> PRIV_CONDOR at sock.C:532 > 12/8 18:29:12 (fd:7) (pid:4559) bind failed errno = 0 > 12/8 18:29:12 (fd:7) (pid:4559) Failed to bind to command ReliSock > 12/8 18:29:12 (fd:7) (pid:4559) (Make sure your IP address is correct in > /etc/hosts.) > 12/8 18:29:12 (fd:7) (pid:4559) ERROR "BindAnyCommandPort failed" at line > 6808 in file daemon_core.C > > > -- > A lie told often enough becomes the truth. > > Lenin (1870 - 1924) > _______________________________________________ > Condor-users mailing list > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/condor-users > > The archives can be found at either > https://lists.cs.wisc.edu/archive/condor-users/ > http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR > -- A lie told often enough becomes the truth. Lenin (1870 - 1924)
Attachment:
realtek_eth_machine.log
Description: Binary data
Attachment:
intel_eth_machine.log
Description: Binary data