Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] protocol error in collector after housekeeping
- Date: Mon, 27 Jun 2016 23:43:40 +0000
- From: Klint Gore <kgore4@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] protocol error in collector after housekeeping
DEFAULT_DOMAIN_NAME = agbu.localdomain
NO_DNS = True
The original message has the full condor_config.local and the condor_config is default. I think I came up with that when I originally installed the 8.0.x series so some of them may not be optimal now.
Klint.
-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum
Sent: Tuesday, 28 June 2016 4:24 AM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] protocol error in collector after housekeeping
Hi Klint,
I think I can reproduce problem you observed by setting
NO_DNS = True
in my central manager.
What value of NO_DNS and DEFAULT_DOMAIN_NAME are you using for your collector? Assuming you have NO_DNS = True in your setup, I think I now know enough about the problem to make a patch in the code, thus sparing this pain for future users.
Thanks for reporting the issue.
regards,
Todd
On 6/27/2016 4:04 AM, Klint Gore wrote:
> I'd just found that and tested it as your message came in.
>
> [root@xxxxxxxxx condor]# condor_config_val -master
> CONDOR_DEVELOPERS_COLLECTOR
>
> Not defined
>
> Setting that to NONE stopped it crashing.
>
> It resolves to 128.105.19.35. Does it use a library to look that up?
> The machine is a minimal centos 7 install so maybe there's a library
> missing.
>
> These machines don't have any access to the outside world anyway so
> it'll never connect.
>
> Klint.
>
> *From:*HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] *On
> Behalf Of *Todd Tannenbaum
> *Sent:* Monday, 27 June 2016 6:36 PM
> *To:* HTCondor-Users Mail List
> *Subject:* Re: [HTCondor-users] protocol error in collector after
> housekeeping
>
> Hi Klint,
>
>
>
> Looks like your collector machine has something bogus setup in the
> /etc/hosts file or DNS when resolving "condor.cs.wisc.edu
> <http://condor.cs.wisc.edu>". Could you investigate that for us?
>
> Meanwhile as an immediate workaround, perhaps you could avoid the
> problem if you put in the condor_config file on your central manager
> machine:
>
> CONDOR_DEVELOPERS_COLLECTOR = NONE
>
> Hope this helps,
>
> Todd
>
> Sent from my iPhone
>
>
> On Jun 27, 2016, at 2:38 AM, Klint Gore <kgore4@xxxxxxxxxx
> <mailto:kgore4@xxxxxxxxxx>> wrote:
>
> Just in case
>
> [root@xxxxxxxxx <mailto:root@xxxxxxxxx> condor]# condor_config_val
> -v COLLECTOR_HOST
> COLLECTOR_HOST = 10.1.1.55
> # at: <Default>
> # raw: COLLECTOR_HOST = $(CONDOR_HOST)
>
>
> -----Original Message-----
> From: Klint Gore
> Sent: Monday, 27 June 2016 5:40 PM
> To: HTCondor-Users Mail List
> Subject: RE: protocol error in collector after housekeeping
>
> [root@xxxxxxxxx <mailto:root@xxxxxxxxx> condor]# condor_config_val
> -master CONDOR_HOST
> 10.1.1.55
> [root@xxxxxxxxx <mailto:root@xxxxxxxxx> condor]# condor_config_val
> -v CONDOR_HOST CONDOR_HOST = 10.1.1.55 # at:
> /etc/condor/config.d/condor_config.local, line 1 # raw: CONDOR_HOST
> = 10.1.1.55
>
> Jobs do get run in the 15 minutes after the collector restarts until
> the housekeeper kicks in.
>
> ------ collector log with D_FULLDEBUG
>
> 06/27/16 17:22:41 Housekeeper: Ready to clean old ads
> 06/27/16 17:22:41 Cleaning StartdAds ...
> 06/27/16 17:22:41 Cleaning StartdPrivateAds ...
> 06/27/16 17:22:41 Cleaning ScheddAds ...
> 06/27/16 17:22:41 Cleaning SubmittorAds ...
> 06/27/16 17:22:41 Cleaning LicenseAds ...
> 06/27/16 17:22:41 Cleaning MasterAds ...
> 06/27/16 17:22:41 Cleaning CkptServerAds ...
> 06/27/16 17:22:41 Cleaning CollectorAds ...
> 06/27/16 17:22:41 Cleaning StorageAds ...
> 06/27/16 17:22:41 Cleaning NegotiatorAds ...
> 06/27/16 17:22:41 Cleaning HadAds ...
> 06/27/16 17:22:41 Cleaning GridAds ...
> 06/27/16 17:22:41 Cleaning XferServiceAds ...
> 06/27/16 17:22:41 Cleaning LeaseManagerAds ...
> 06/27/16 17:22:41 Cleaning Generic Ads ...
> 06/27/16 17:22:41 Housekeeper: Done cleaning
> 06/27/16 17:22:42 ScheddAd : Updating ... "<
> 10-1-1-61.agbu.localdomain , 10.1.1.61 >"
> 06/27/16 17:22:42 In OfflineCollectorPlugin::update ( 1 )
> 06/27/16 17:22:42 CollectorAd : Updating ... "<
> AGBU@xxxxxxxxxxxxxxxxxxxxxxxxxx
> <mailto:AGBU@xxxxxxxxxxxxxxxxxxxxxxxxxx> >"
> 06/27/16 17:22:42 Attempting to send update via UDP to collector
> condor.cs.wisc.edu <http://condor.cs.wisc.edu> <:9618>
> 06/27/16 17:22:42 ERROR "Unknown protocol (1) in Sock::bind();
> aborting." at line 741 in file
> /slots/01/dir_1114870/userdir/.tmpthm9vL/BUILD/condor-8.4.
> 7/src/condor_io/sock.cpp
> ------
>
> Looks like the address is blank in that attempting to update line.
>
> Klint.
>
> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On
> Behalf Of Iain Bradford Steers
> Sent: Monday, 27 June 2016 4:35 PM
> To: HTCondor-Users Mail List
> Subject: Re: [HTCondor-users] protocol error in collector after
> housekeeping
>
> Hi Klint,
>
> I've seen this error message type in the past when I've accidentally
> appended the port to the address a second time.
>
> However your CONDOR_HOST var seems okay.
>
> Could you run the following:
>
> condor_config_val -master CONDOR_HOST
>
> condor_config_val -v CONDOR_HOST
>
> I think we can ignore the connection refused error for the moment.
> The master doesn't know the collector is dead, so is trying to send
> an update, I think. (Sounds like a bug in itself really)
>
> Could you bump up the debugging?
>
> MASTER_DEBUG = D_FULLDEBUG
> COLLECTOR_DEBUG = D_FULLDEBUG
>
> Cheers, Iain
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> <mailto:htcondor-users-request@xxxxxxxxxxx> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/