Still seeing a timeout. in the ShadowLog, a 30 second gap
03/26/19 08:55:37 (D_NETWORK) condor_read(fd=7 <192.168.2.1:56589>,,size=5,timeout=20,flags=0,non_blocking=0)
03/26/19 08:55:37 (D_NETWORK) condor_read(): fd=7
03/26/19 08:55:57 (D_NETWORK) condor_read(): select returned 0
03/26/19 08:55:57 (D_ALWAYS) condor_read(): timeout reading 5 bytes from <192.168.2.1:56589>.
And a 20 second gap in the StarterLog, followed by a timeout.
03/26/19 08:55:37 (D_NETWORK) condor_read(fd=7 <192.168.2.1:56589>,,size=5,timeout=20,flags=0,non_blocking=0)
03/26/19 08:55:37 (D_NETWORK) condor_read(): fd=7
03/26/19 08:55:57 (D_NETWORK) condor_read(): select returned 0
Lets try turning the logging up even further.
please change your config to this, reconfig and retry
SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK D_HOSTNAME D_SECURITY:1
STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK D_HOSTNAME D_SECURITY:1
Ok, added those commands to my condor_config.local. I also noticed that when I installed the new version, the condor_config file and the condor directory didnât have full permissions, so I changed that to user condor.
Attached are the ShadowLog and StartLog after the reconfig.
This is starting to look like something broken in your network stack that is outside of HTCondor. The Sched starts the Shadow at 06:58:03
but the shadow doesnât actually begin logging until 06:58:33. The shadow doesnât do much on startup before it begins logging, but one of the things
it does is to read the HTCondor configuration files. Are those files on a shared file system perhaps?
The shadow then contacts the startd at 06:58:33, but then times out 30 seconds later. During this time we see this in the StarterLog
03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed
03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.
A timeout after 20 seconds. This looks a lot like your network is occasionally stalling for periods of 30 seconds, and this is leading HTCondor to time out a lot.
Letâs turn on some more logging and maybe get more insight the problem.
please add this to your configuration.
SHADOW_DEBUG = $(SHADOW_DEBUG) D_CAT D_FULLDEBUG D_NETWORK
STARTD_DEBUG = $(STARTD_DEBUG) D_CAT D_FULLDEBUG D_NETWORK
Then run condor_reconfig on both submit and execute machines and try again to run a job.
03/26/19 06:58:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>
03/26/19 06:58:03 (pid:51270) Using negotiation protocol: NEGOTIATE
03/26/19 06:58:03 (pid:51270) Lost priority - 1 jobs matched
03/26/19 06:58:03 (pid:51270) Finished negotiating for condor in local pool: 1 matched, 0 rejected
03/26/19 06:58:03 (pid:51270) Starting add_shadow_birthdate(3.0)
03/26/19 06:58:03 (pid:51270) Started shadow for job 3.0 onslot1@xxxxxxxxxxxxxxxxxx <192.168.2.1:51092?addrs=192.168.2.1-51092>
for condor, (shadow pid = 92037)
03/26/19 06:58:10 (pid:51270) Number of Active Workers 0
03/26/19 06:59:03 (pid:51270) Activity on stashed negotiator socket: <192.168.2.1:51109>
03/26/19 06:59:03 (pid:51270) Using negotiation protocol: NEGOTIATE
03/26/19 06:59:03 (pid:51270) Finished negotiating for condor in local pool: 0 matched, 0 rejected
03/26/19 06:59:03 (pid:51270) Shadow pid 92037 for job 3.0 exited with status 108
What is happening in the SchedLog during this 20 second time period?
03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed
03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.
Follow-up: firewall on the Mac is turned off.
03/26/19 06:58:03 slot1: Request accepted.
03/26/19 06:58:33 slot1: State change: claiming protocol successful
03/26/19 06:58:33 slot1: Changing state: Unclaimed -> Claimed
03/26/19 06:58:53 condor_read(): timeout reading 5 bytes from <192.168.2.1:56069>.
03/26/19 06:58:53 IO: Failed to read packet header
03/26/19 06:58:53 Can't read ClaimId
03/26/19 06:59:03 slot1: Called deactivate_claim()
03/26/19 06:59:03 slot1: State change: received RELEASE_CLAIM command
03/26/19 06:59:03 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
03/26/19 06:59:03 slot1: State change: No preempting claim, returning to owner
03/26/19 06:59:03 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
03/26/19 06:59:03 slot1: State change: IS_OWNER is false
03/26/19 06:59:03 slot1: Changing state: Owner -> Unclaimed
I also have the condor directory shared (or at least with group rw). Iâm also running condor as user condor and not as root, if that helps.
Yes. this looks a lot like a failure to authorize the connection, or possibly a firewall.
03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx,
fd is 5
03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed
03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx:
DCStartd::activateClaim: Failed to send job ClassAd to the startd
Are you running 8.8.1 on all of the nodes? No, but Iâm only testing on my central manager now. I want to get that correct before updating my Windows nodes. The job(s) Iâm submitting are targeted to the central manager.
Does the output of condor_status, show all of your execute nodes? It shows all the slots on my central manager.
Does condor_q show your jobs? Yes.
I did notice that it was taking exactly 30 seconds for condor to respond to a condor_q, condor_status, and condor_submit, so I think that helps confirm my suspicion that there is a setting somewhere I donât
have correct.
Hereâs the (partial) output from the ShadowLog:
03/26/19 06:58:33 Using local config sources:
03/26/19 06:58:33 /usr/local/condor/local.reaper/condor_config.local
03/26/19 06:58:33 config Macros = 70, Sorted = 70, StringBytes = 2008, TablesBytes = 1168
03/26/19 06:58:33 CLASSAD_CACHING is OFF
03/26/19 06:58:33 Daemon Log is logging: D_ALWAYS D_ERROR
03/26/19 06:58:33 Daemoncore: Listening at <0.0.0.0:56067> on TCP (ReliSock).
03/26/19 06:58:33 DaemonCore: command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067&noUDP>
03/26/19 06:58:33 DaemonCore: private command socket at <192.168.2.1:56067?addrs=192.168.2.1-56067>
03/26/19 06:58:33 Initializing a VANILLA shadow for job 3.0
03/26/19 06:59:03 (3.0) (92037): condor_write(): Socket closed when trying to write 4096 bytes to startd slot1@xxxxxxxxxxxxxxxxxx,
fd is 5
03/26/19 06:59:03 (3.0) (92037): Buf::write(): condor_write() failed
03/26/19 06:59:03 (3.0) (92037): slot1@xxxxxxxxxxxxxxxxxx:
DCStartd::activateClaim: Failed to send job ClassAd to the startd
03/26/19 06:59:03 (3.0) (92037): logEvictEvent with unknown reason (108), not logging.
03/26/19 06:59:03 (3.0) (92037): **** condor_shadow (condor_SHADOW) pid 92037 EXITING WITH STATUS 108
Here is a copy of my condor_config.local as well.
## Where
have you installed the bin, sbin and lib condor directories?
RELEASE_DIR = /usr/local/condor
## Where
is the local condor directory for each host? This is where the local config file(s),
logs and
## spool/execute
directories are located. this is the default for Linux and Unix systems.
## this
is the default on Windows sytems
NETWORK_INTERFACE = 192.168.2.1
LOCAL_DIR = /usr/local/condor/local.reaper
UID_DOMAIN = reaper.localnet
FILESYSTEM_DOMAIN = reaper.localnet
LOCK = /tmp/condor-lock.0.21512031190014
JAVA_MAXHEAP_ARGUMENT = -Xmx1024m
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
CONDOR_HOST = 192.168.2.1
SEC_DEFAULT_NEGOTIATION = OPTIONAL
HOSTALLOW_WRITE = 192.168.2.*
Ok. We need a bit more information in order to figure out what is happening. Lets start with the basics.
Are you running 8.8.1 on all of the nodes?
Does the output of condor_status, show all of your execute nodes?
Does condor_q show your jobs?
if the jobs are getting matches, but failing to start, then the place to look is in the ShadowLog
on the submit machine. run
condor_config_val shadow_log
on the submit node to find out where that is. You should expect to see messages indicating that a condor_shadow
has started up, and then it will identify what job it is attempting to run.
HTCondor is running really slow, and now itâs not accepting jobs, i.e., when I do a better-analyze, it matches, then subsequently rejects. I had 8.6.1 installed, removed that, and installed 8.8.1, hoping that was the problem. The central
manager is on Mac OS10, with several Windows execute nodes. I also have the Mac as an execute and submit node. Iâm sure this is a configuration issue somewhere, but I canât figure out where. I do get a âinit_local_hostname_impl: ipv6_getaddrinfo() could not
look upâ in my MasterLog, but I have ENABLE_IPV6 disabled.
Faculty Associate - Research
SEED (Simulation Experiments & Efficient Designs) Center for Data Farming
Operations Research Department
Naval Postgraduate School
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/