Hi,
I've been trying to install HTCondor
(condor-8.2.1-256063-Windows-x86.msi) on my Windows 7 computer
at work, and I've been stuck at the point where jobs I submit
never start.
Before giving more details on the problem, I just want to
point out there is a typo in the default condor_config file,
which is written:
CONDOR_HOST = $$(FULL_HOSTNAME)
instead of:
CONDOR_HOST = $(FULL_HOSTNAME)
(with a single $)
...
022
(009.000.000) 07/29 22:52:50 Job disconnected, attempting
to reconnect
Socket between
submit and execute hosts closed unexpectedly
...
024
(009.000.000) 07/29 22:52:50 Job reconnection failed
Job not
found at execution machine
In the ShadowLog, I have this error:
07/29/14
22:52:49
******************************************************
07/29/14
22:52:49 Using config source: D:\condor\condor_config
07/29/14
22:52:49 Using local config sources:
07/29/14
22:52:49 D:\condor\condor_config.local
07/29/14
22:52:49 config Macros = 42, Sorted = 42, StringBytes =
743, TablesBytes = 360
07/29/14
22:52:49 CLASSAD_CACHING is OFF
07/29/14
22:52:49 Daemon Log is logging: D_ALWAYS D_ERROR
07/29/14
22:52:49 Initializing a VANILLA shadow for job 9.0
07/29/14
22:52:50 (9.0) (1272): IO: Failed to read packet header
07/29/14
22:52:50 (9.0) (1272): Trying to reconnect to disconnected
job
07/29/14
22:52:50 (9.0) (1272): LastJobLeaseRenewal: 1406688769 Tue
Jul 29 22:52:49 2014
07/29/14
22:52:50 (9.0) (1272): JobLeaseDuration: 1200 seconds
07/29/14
22:52:50 (9.0) (1272): JobLeaseDuration remaining: 1199
07/29/14
22:52:50 (9.0) (1272): Attempting to locate disconnected
starter
07/29/14
22:52:50 (9.0) (1272): Reconnect FAILED: Job not found at
execution machine
07/29/14
22:52:50 (9.0) (1272): **** condor_shadow (condor_SHADOW)
pid 1272 EXITING WITH STATUS 107
07/29/14
22:53:49
******************************************************
Finally, I enabled D_ALL logs in the StartLog which I
suspect is the main one of interest, and here is what I see
around the error (which occurs near bottom of this excerpt):
07/29/14
22:52:49 (fd:3) (pid:1428) (D_NETWORK) condor_write(fd=684
<127.0.0.1:63243>,,size=4096,timeout=0,flags=0,non_blocking=0)
07/29/14
22:52:49 (fd:3) (pid:1428) (D_NETWORK) condor_write(fd=684
<127.0.0.1:63243>,,size=738,timeout=0,flags=0,non_blocking=0)
07/29/14
22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) In
DaemonCore::Create_Process(D:\condor\bin\condor_starter.exe,...)
07/29/14
22:52:49 (fd:3) (pid:1428) (D_NETWORK)
InitCommandSocket(IPv4, 1, want UDP, non-fatal errors)
created <10.128.20.195:63244>
07/29/14
22:52:49 (fd:3) (pid:1428) (D_SECURITY) SECMAN: created
non-negotiated security session
80303a6d948e06827975e04f9a5113d74d74f7b68318afd5 for 0
(inf) seconds.
07/29/14
22:52:49 (fd:3) (pid:1428) (D_SECURITY) SECMAN: exporting
session info for
80303a6d948e06827975e04f9a5113d74d74f7b68318afd5:
[CurrentTime=time();Encryption="NO";Integrity="NO";CryptoMethods="3DES";]
07/29/14
22:52:49 (fd:3) (pid:1428) (D_PROCFAMILY) About to
register family for PID 8160 with the ProcD
07/29/14
22:52:49 (fd:3) (pid:1428) (D_PROCFAMILY) Result of
"register_subfamily" operation from ProcD: SUCCESS
07/29/14
22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) Child Process:
pid 8160 at <10.128.20.195:63244>
(0.00 sec)
07/29/14
22:52:49 (fd:3) (pid:1428) (D_NETWORK) CLOSE <127.0.0.1:63243>
fd=1256
07/29/14
22:52:49 (fd:3) (pid:1428) (D_ALWAYS) slot1: Got universe
"VANILLA" (5) from request classad
07/29/14
22:52:49 (fd:3) (pid:1428) (D_ALWAYS) slot1: State change:
claim-activation protocol successful
07/29/14
22:52:49 (fd:3) (pid:1428) (D_ALWAYS) slot1: Changing
activity: Idle -> Busy
07/29/14
22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) in DaemonCore
NewTimer()
07/29/14
22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) leaving
DaemonCore NewTimer, id=67
07/29/14
22:52:49 (fd:3) (pid:1428) (D_COMMAND) Return from
HandleReq <command_activate_claim> (handler: 0.016s,
sec: 0.000s, payload: 0.000s)
07/29/14
22:52:49 (fd:3) (pid:1428) (D_PRIV) PRIV_CONDOR -->
PRIV_CONDOR at
c:\condor\execute\dir_18128\userdir\src\condor_daemon_core.v6\daemon_core.cpp:4101
07/29/14
22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) In DaemonCore
Timeout()
07/29/14
22:52:49 (fd:3) (pid:1428) (D_DAEMONCORE) DaemonCore
Timeout() Complete, returning 4
07/29/14
22:52:49 (fd:3) (pid:1428) (D_ALWAYS) PERF: entering
select
07/29/14
22:52:50 (fd:3) (pid:1428) (D_ALWAYS) PERF: leaving select
07/29/14
22:52:50 (fd:3) (pid:1428) (D_ALWAYS) State = FDS_READY
07/29/14
22:52:50 (fd:3) (pid:1428) (D_ALWAYS) max_fd = 1156
07/29/14
22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Selection FD's
07/29/14
22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Read {576 600 684 1156
} = 4
07/29/14
22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Write {} = 0
07/29/14
22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Except {} = 0
07/29/14
22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Ready FD's
07/29/14
22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Read {684 } = 1
07/29/14
22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Write {} = 0
07/29/14
22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Except {} = 0
07/29/14
22:52:50 (fd:3) (pid:1428) (D_ALWAYS) Timeout = 4.000000
seconds
07/29/14
22:52:50 (fd:3) (pid:1428) (D_DAEMONCORE) Calling Handler
<receiveJobClassAdUpdate> for Socket <starter
ClassAd update socket>
07/29/14
22:52:50 (fd:3) (pid:1428) (D_COMMAND) Calling Handler
<receiveJobClassAdUpdate> (2)
07/29/14
22:52:50 (fd:3) (pid:1428) (D_NETWORK) condor_read(fd=684
<127.0.0.1:63243>,,size=5,timeout=10,flags=0,non_blocking=0)
07/29/14 22:52:50
(fd:3) (pid:1428) (D_ALWAYS) condor_read() failed:
recv(fd=684) returned -1, errno = 10054 , reading 5 bytes
from <127.0.0.1:63243>.
07/29/14
22:52:50 (fd:3) (pid:1428) (D_ALWAYS) IO: Failed to read
packet header
07/29/14
22:52:50 (fd:3) (pid:1428) (D_NETWORK) Stream::get(int)
failed to read padding
07/29/14
22:52:50 (fd:3) (pid:1428) (D_DAEMONCORE) Cancel_Socket:
cancelled socket 2 <starter ClassAd update socket>
01D51760
07/29/14
22:52:50 (fd:3) (pid:1428) (D_NETWORK) CLOSE <127.0.0.1:63242>
fd=684
I've tried tons of config variants, toying with options
COLLECTOR_HOST, use SECURITY,
NO_DNS, NETWORK_INTERFACE, UID_DOMAIN, DEFAULT_DOMAIN_NAME, BIND_ALL_INTERFACES, UPDATE_COLLECTOR_WITH_TCP...
but the error is basically always the same.
This is on a single computer which I setup to be both a
submit and an execute node.
I wonder if the problem might be because it's using
127.0.0.1, I'm not sure why it uses it instead of
10.128.20.195, which is the computer's IP on the network. I'm
just saying that because if I try to force the IP to 127.0.0.1
through NETWORK_INTERFACE then nothing works at all (I can't
even submit a job). It's just a wild guess though.
Thanks for any help,
-=- Olivier