Thanks â I hadnât changed anything else on the machine, but weâve been having problems since a restart of the machine after a power outage a month ago. Iâm not sure where to look regarding the DNS server,
and I know thatâs outside the domain of HTCondor, so Iâll check into that. steve From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx> One thought. A DNS server that is taking 30 seconds to respond to queries might explain what we are seeing. -tj From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of John M Knoeller This is starting to look like something broken in your network stack that is outside of HTCondor. The Sched starts the Shadow at 06:58:03 but the shadow doesnât actually begin logging until
06:58:33. The shadow doesnât do much on startup before it begins logging, but one of the things it does is to read the HTCondor configuration files. Are those files on a shared file system perhaps? The shadow then contacts the startd at 06:58:33, but then times out 30 seconds later. During this time we see this in the StarterLog 03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed 03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>. A timeout after 20 seconds. This looks a lot like your network is occasionally stalling for periods of 30 seconds, and this is leading HTCondor to time out a lot.
Letâs turn on some more logging and maybe get more insight the problem. please add this to your configuration. SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK Then run condor_reconfig on both submit and execute machines and try again to run a job. -tj From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Upton, Stephen (Steve) (CIV) Hereâs a bit more time: 03/26/19 06:58:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109> 03/26/19 06:58:03 (pid:51270) Using negotiation protocol: NEGOTIATE 03/26/19 06:58:03 (pid:51270) Negotiating for owner:
condor@xxxxxxxxxxxxxxx 03/26/19 06:58:03 (pid:51270) Lost priority - 1 jobs matched 03/26/19 06:58:03 (pid:51270) Finished negotiating for condor in local pool: 1 matched, 0 rejected 03/26/19 06:58:03 (pid:51270) Starting add_shadow_birthdate(3.0) 03/26/19 06:58:03 (pid:51270) Started shadow for job 3.0 on
slot1@xxxxxxxxxxxxxxxxxx <192.168.2.1:51092?addrs=192.168.2.1-51092> for condor, (shadow pid = 92037) 03/26/19 06:58:10 (pid:51270) Number of Active Workers 0 03/26/19 06:59:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109> 03/26/19 06:59:03 (pid:51270) Using negotiation protocol: NEGOTIATE 03/26/19 06:59:03 (pid:51270) Negotiating for owner:
condor@xxxxxxxxxxxxxxx 03/26/19 06:59:03 (pid:51270) Finished negotiating for condor in local pool: 0 matched, 0 rejected 03/26/19 06:59:03 (pid:51270) Shadow pid 92037 for job 3.0 exited with status 108 steve From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx> What is happening in the SchedLog during this 20 second time period? 03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed 03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>. -tj From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Upton, Stephen (Steve) (CIV) Follow-up: firewall on the Mac is turned off. steve From: Stephen Upton <scupton@xxxxxxx> Here you go: 03/26/19 06:58:03 slot1: Request accepted. 03/26/19 06:58:33 slot1: Remote owner is
condor@xxxxxxxxxxxxxxx 03/26/19 06:58:33 slot1: State change: claiming protocol successful 03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed 03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>. 03/26/19 06:58:53 IO: Failed to read packet header 03/26/19 06:58:53 Can't read ClaimId 03/26/19 06:59:03 slot1: Called deactivate_claim() 03/26/19 06:59:03 slot1: State change: received RELEASE_CLAIM command 03/26/19 06:59:03 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating 03/26/19 06:59:03 slot1: State change: No preempting claim, returning to owner 03/26/19 06:59:03 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle 03/26/19 06:59:03 slot1: State change: IS_OWNER is false 03/26/19 06:59:03 slot1: Changing state: Owner -> Unclaimed I also have the condor directory shared (or at least with group rw). Iâm also running condor as user condor and not as root, if that helps. steve From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx> Yes. this looks a lot like a failure to authorize the connection, or possibly a firewall. 03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd
slot1@xxxxxxxxxxxxxxxxxx, fd is 5 03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed 03/26/19 06:59:03 (3.0) (92037):
slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd What does the StartLog and/or StarterLog show at 6:58/6:59 on reaper.ern.nps.edu? -tj From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Upton, Stephen (Steve) (CIV) TJ, Thanks for responding! Are you running 8.8.1 on all of the nodes? No, but Iâm only testing on my central manager now. I want to get that correct before updating my Windows nodes. The job(s) Iâm submitting are targeted to the central
manager. Does the output of condor_status, show all of your execute nodes? It shows all the slots on my central manager.
Does condor_q show your jobs? Yes. I did notice that it was taking exactly 30 seconds for condor to respond to a condor_q, condor_status, and condor_submit, so I think that helps confirm my suspicion that there is a setting somewhere I donât have correct.
Hereâs the (partial) output from the ShadowLog: 03/26/19 06:58:33 Using local config sources: 03/26/19 06:58:33
/usr/local/condor/local.reaper/condor_config.local 03/26/19 06:58:33 config Macros = 70, Sorted = 70, StringBytes = 2008, TablesBytes = 1168 03/26/19 06:58:33 CLASSAD_CACHING is OFF 03/26/19 06:58:33 Daemon Log is logging: D_ALWAYS D_ERROR 03/26/19 06:58:33 Daemoncore: Listening at <0.0.0.0:56067> on TCP (ReliSock). 03/26/19 06:58:33 DaemonCore: command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067&noUDP> 03/26/19 06:58:33 DaemonCore: private command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067> 03/26/19 06:58:33 Initializing a VANILLA shadow for job 3.0 03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd
slot1@xxxxxxxxxxxxxxxxxx, fd is 5 03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed 03/26/19 06:59:03 (3.0) (92037):
slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd 03/26/19 06:59:03 (3.0) (92037): Job 3.0 is being evicted from
slot1@xxxxxxxxxxxxxxxxxx 03/26/19 06:59:03 (3.0) (92037): logEvictEvent with unknown reason (108), not logging. 03/26/19 06:59:03 (3.0) (92037): **** condor_shadow (condor_SHADOW) pid 92037 EXITING WITH STATUS 108 Here is a copy of my condor_config.local as well. ##
Where have you installed the bin, sbin and lib condor directories? RELEASE_DIR = /usr/local/condor ENABLE_IPV6 = False USE_SHARED_PORT = False ##
Where is the local condor directory for each host?
This is where the local config file(s), logs and ##
spool/execute directories are located. this is the default for Linux and Unix systems. ##
this is the default on Windows sytems NETWORK_HOSTNAME = reaper.ern.nps.edu NETWORK_INTERFACE = 192.168.2.1 LOCAL_DIR = /usr/local/condor/local.reaper UID_DOMAIN = reaper.localnet JAVA = /usr/bin/java CONDOR_ADMIN =
root@xxxxxxxxxxxxxxxxxxxxxx MAIL = /usr/bin/mail FILESYSTEM_DOMAIN = reaper.localnet LOCK = /tmp/condor-lock.0.21512031190014 JAVA_MAXHEAP_ARGUMENT = -Xmx1024m DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD CONDOR_HOST = 192.168.2.1 CONDOR_IDS = 503.20 NUM_CPUS = 15 SEC_DEFAULT_NEGOTIATION = OPTIONAL HOSTALLOW_WRITE = 192.168.2.* From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx> Ok. We need a bit more information in order to figure out what is happening. Lets start with the basics. Are you running 8.8.1 on all of the nodes? Does the output of condor_status, show all of your execute nodes? Does condor_q show your jobs? if the jobs are getting matches, but failing to start, then the place to look is in the ShadowLog on the submit machine. run condor_config_val shadow_log on the submit node to find out where that is. You should expect to see messages indicating that a condor_shadow has started up, and then it will identify what job it is attempting to run.
-tj From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Upton, Stephen (Steve) (CIV) Hi all, HTCondor is running really slow, and now itâs not accepting jobs, i.e., when I do a better-analyze, it matches, then subsequently rejects. I had 8.6.1 installed, removed that, and installed 8.8.1, hoping that
was the problem. The central manager is on Mac OS10, with several Windows execute nodes. I also have the Mac as an execute and submit node. Iâm sure this is a configuration issue somewhere, but I canât figure out where. I do get a âinit_local_hostname_impl:
ipv6_getaddrinfo() could not look upâ in my MasterLog, but I have ENABLE_IPV6 disabled. Thanx steve Stephen C. Upton Faculty Associate - Research SEED (Simulation Experiments & Efficient Designs) Center for Data Farming Operations Research Department Naval Postgraduate School Mobile: 804-994-4257 NIPR: scupton@xxxxxxx SIPR: uptonsc@xxxxxxxxxxxxxxxxx SEED Center website: https://harvest.nps.edu |