Re: [HTCondor-users] Condor running slow (q,status,submit)

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Thanks â I hadnât changed anything else on the machine, but weâve been having problems since a restart of the machine after a power outage a month ago. Iâm not sure where to look regarding the DNS server, and I know thatâs outside the domain of HTCondor, so Iâll check into that.

steve

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 11:57 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

One thought. A DNS server that is taking 30 seconds to respond to queries might explain what we are seeing.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller
Sent: Tuesday, March 26, 2019 10:38 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

This is starting to look like something broken in your network stack that is outside of HTCondor. The Sched starts the Shadow at 06:58:03

but the shadow doesnât actually begin logging until 06:58:33. The shadow doesnât do much on startup before it begins logging, but one of the things it does is to read the HTCondor configuration files. Are those files on a shared file system perhaps?

The shadow then contacts the startd at 06:58:33, but then times out 30 seconds later. During this time we see this in the StarterLog

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

A timeout after 20 seconds. This looks a lot like your network is occasionally stalling for periods of 30 seconds, and this is leading HTCondor to time out a lot.

Letâs turn on some more logging and maybe get more insight the problem.

please add this to your configuration.

SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK

STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK

Then run condor_reconfig on both submit and execute machines and try again to run a job.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 10:17 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

Hereâs a bit more time:

03/26/19 06:58:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>

03/26/19 06:58:03 (pid:51270) Using negotiation protocol: NEGOTIATE

03/26/19 06:58:03 (pid:51270) Negotiating for owner: condor@xxxxxxxxxxxxxxx

03/26/19 06:58:03 (pid:51270) Lost priority - 1 jobs matched

03/26/19 06:58:03 (pid:51270) Finished negotiating for condor in local pool: 1 matched, 0 rejected

03/26/19 06:58:03 (pid:51270) Starting add_shadow_birthdate(3.0)

03/26/19 06:58:03 (pid:51270) Started shadow for job 3.0 on slot1@xxxxxxxxxxxxxxxxxx <192.168.2.1:51092?addrs=192.168.2.1-51092> for condor, (shadow pid = 92037)

03/26/19 06:58:10 (pid:51270) Number of Active Workers 0

03/26/19 06:59:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>

03/26/19 06:59:03 (pid:51270) Using negotiation protocol: NEGOTIATE

03/26/19 06:59:03 (pid:51270) Negotiating for owner: condor@xxxxxxxxxxxxxxx

03/26/19 06:59:03 (pid:51270) Finished negotiating for condor in local pool: 0 matched, 0 rejected

03/26/19 06:59:03 (pid:51270) Shadow pid 92037 for job 3.0 exited with status 108

steve

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 11:05 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

What is happening in the SchedLog during this 20 second time period?

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 9:42 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

Follow-up: firewall on the Mac is turned off.

steve

From: Stephen Upton <scupton@xxxxxxx>
Date: Tuesday, March 26, 2019 at 10:38 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

Here you go:

03/26/19 06:58:03 slot1: Request accepted.

03/26/19 06:58:33 slot1: Remote owner is condor@xxxxxxxxxxxxxxx

03/26/19 06:58:33 slot1: State change: claiming protocol successful

03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed

03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.

03/26/19 06:58:53 IO: Failed to read packet header

03/26/19 06:58:53 Can't read ClaimId

03/26/19 06:59:03 slot1: Called deactivate_claim()

03/26/19 06:59:03 slot1: State change: received RELEASE_CLAIM command

03/26/19 06:59:03 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating

03/26/19 06:59:03 slot1: State change: No preempting claim, returning to owner

03/26/19 06:59:03 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle

03/26/19 06:59:03 slot1: State change: IS_OWNER is false

03/26/19 06:59:03 slot1: Changing state: Owner -> Unclaimed

I also have the condor directory shared (or at least with group rw). Iâm also running condor as user condor and not as root, if that helps.

steve

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 10:34 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

Yes. this looks a lot like a failure to authorize the connection, or possibly a firewall.

03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5

03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed

03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd

What does the StartLog and/or StarterLog show at 6:58/6:59 on reaper.ern.nps.edu?

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Tuesday, March 26, 2019 9:09 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

TJ,

Thanks for responding!

Are you running 8.8.1 on all of the nodes? No, but Iâm only testing on my central manager now. I want to get that correct before updating my Windows nodes. The job(s) Iâm submitting are targeted to the central manager.

Does the output of condor_status, show all of your execute nodes? It shows all the slots on my central manager.

Does condor_q show your jobs? Yes.

I did notice that it was taking exactly 30 seconds for condor to respond to a condor_q, condor_status, and condor_submit, so I think that helps confirm my suspicion that there is a setting somewhere I donât have correct.

Hereâs the (partial) output from the ShadowLog:

03/26/19 06:58:33 Using local config sources:

03/26/19 06:58:33 /usr/local/condor/local.reaper/condor_config.local

03/26/19 06:58:33 config Macros = 70, Sorted = 70, StringBytes = 2008, TablesBytes = 1168

03/26/19 06:58:33 CLASSAD_CACHING is OFF

03/26/19 06:58:33 Daemon Log is logging: D_ALWAYS D_ERROR

03/26/19 06:58:33 Daemoncore: Listening at <0.0.0.0:56067> on TCP (ReliSock).

03/26/19 06:58:33 DaemonCore: command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067&noUDP>

03/26/19 06:58:33 DaemonCore: private command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067>

03/26/19 06:58:33 Initializing a VANILLA shadow for job 3.0

03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx, fd is 5

03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed

03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send job ClassAd to the startd

03/26/19 06:59:03 (3.0) (92037): Job 3.0 is being evicted from slot1@xxxxxxxxxxxxxxxxxx

03/26/19 06:59:03 (3.0) (92037): logEvictEvent with unknown reason (108), not logging.

03/26/19 06:59:03 (3.0) (92037): **** condor_shadow (condor_SHADOW) pid 92037 EXITING WITH STATUS 108

Here is a copy of my condor_config.local as well.

## Where have you installed the bin, sbin and lib condor directories?

RELEASE_DIR = /usr/local/condor

ENABLE_IPV6 = False

USE_SHARED_PORT = False

## Where is the local condor directory for each host? This is where the local config file(s), logs and

## spool/execute directories are located. this is the default for Linux and Unix systems.

## this is the default on Windows sytems

NETWORK_HOSTNAME = reaper.ern.nps.edu

NETWORK_INTERFACE = 192.168.2.1

LOCAL_DIR = /usr/local/condor/local.reaper

UID_DOMAIN = reaper.localnet

JAVA = /usr/bin/java

CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxx

MAIL = /usr/bin/mail

FILESYSTEM_DOMAIN = reaper.localnet

LOCK = /tmp/condor-lock.0.21512031190014

JAVA_MAXHEAP_ARGUMENT = -Xmx1024m

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

CONDOR_HOST = 192.168.2.1

CONDOR_IDS = 503.20

NUM_CPUS = 15

SEC_DEFAULT_NEGOTIATION = OPTIONAL

HOSTALLOW_WRITE = 192.168.2.*

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, March 26, 2019 at 9:40 AM
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Cc: Mary McDonald <mlmcdona@xxxxxxx>
Subject: Re: [HTCondor-users] Condor running slow (q,status,submit)

Ok. We need a bit more information in order to figure out what is happening. Lets start with the basics.

Are you running 8.8.1 on all of the nodes?

Does the output of condor_status, show all of your execute nodes?

Does condor_q show your jobs?

if the jobs are getting matches, but failing to start, then the place to look is in the ShadowLog

on the submit machine. run

condor_config_val shadow_log

on the submit node to find out where that is. You should expect to see messages indicating that a condor_shadow

has started up, and then it will identify what job it is attempting to run.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Upton, Stephen (Steve) (CIV)
Sent: Monday, March 25, 2019 4:30 PM
To: htcondor-users@xxxxxxxxxxx
Cc: McDonald, Mary (CIV) <mlmcdona@xxxxxxx>
Subject: [HTCondor-users] Condor running slow (q,status,submit)

Hi all,

HTCondor is running really slow, and now itâs not accepting jobs, i.e., when I do a better-analyze, it matches, then subsequently rejects. I had 8.6.1 installed, removed that, and installed 8.8.1, hoping that was the problem. The central manager is on Mac OS10, with several Windows execute nodes. I also have the Mac as an execute and submit node. Iâm sure this is a configuration issue somewhere, but I canât figure out where. I do get a âinit_local_hostname_impl: ipv6_getaddrinfo() could not look upâ in my MasterLog, but I have ENABLE_IPV6 disabled.

Thanx

steve

Stephen C. Upton

Faculty Associate - Research

SEED (Simulation Experiments & Efficient Designs) Center for Data Farming

Operations Research Department

Naval Postgraduate School

Mobile: 804-994-4257

NIPR: scupton@xxxxxxx

SIPR: uptonsc@xxxxxxxxxxxxxxxxx

SEED Center website: https://harvest.nps.edu

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Condor running slow (q,status,submit)