Subject: Re: [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2
Tim, I am still monitoring and trying to
tack the problem down. The schedd on my central manager threw the following
core.SCHEDD so maybe this will help.
Thanks for any assistance you can provide.
I will try to continue monitoring, but I am not picking up too many errors.
Everything was working and then sometime in the last 8 hours 9pm-5am 4
of my submit machines dropped off the schedd list (condor_status -schedd)
and I can't find any errors in the log files during this time.
I did find this core dump (I have not
seen this before and it happend last night on the central manager).
I also noticed a couple different errors in my schedd and master log files
on submit machines, but these errors do not seem to be thrown when the
central manager looses track of the submit machines (this is what I have
had a difficult time tracking--I have enabled all logging for now).
IP2 is a server within my organization
that is scanning our machines, but these servers, do not have access to
the pool itself but they can scan all hardware. I am working with my IT
group to see if they can help on this front.
This stood out, but I could not find a reference in Condor and I do not
have SOAP enabled:
10/09/12 06:22:36 Received HTTP GET
connection from <IP2.12:36427> -- DENIED because ENABLE_WEB_SERVER=FALSE
CDRS2—masterLog (I also see the
same error in the ScheddLog on same machine) 10/09/12 06:18:21 Time stamp of running
C:/Condor/bin/condor_master.exe: 1344461336
10/09/12 06:18:21 GetTimeStamp returned:
1344461336
10/09/12 06:18:21 Return from Timer
handler 10 (Daemons::CheckForNewExecutable())
10/09/12 06:18:22 Calling Timer handler
6 (KillFamily::takesnapshot)
10/09/12 06:18:22 Return from Timer
handler 6 (KillFamily::takesnapshot)
10/09/12 06:18:22 Calling Timer handler
7 (KillFamily::takesnapshot)
10/09/12 06:18:22 Return from Timer
handler 7 (KillFamily::takesnapshot)
10/09/12 06:18:51 ACCEPT bound to <IP.52:62187>
fd=540 peer=<IP2.12:33239>
10/09/12 06:18:51 Calling Handler <DaemonCommandProtocol::WaitForSocketData>
(2)
10/09/12 06:18:51 condor_read(fd=540
<IP2.12:33239>,,size=4,timeout=1,flags=2)
10/09/12 06:18:51 condor_read(): fd=540
10/09/12 06:18:51 condor_read(): select
returned 1
10/09/12 06:18:51 condor_read() failed:
recv(fd=540) returned -1, errno = 10054 , reading 4 bytes from <IP2.12:33239>.
10/09/12 06:18:51 condor_read(fd=540
<IP2.12:33239>,,size=5,timeout=1,flags=0)
10/09/12 06:18:51 condor_read(): fd=540
10/09/12 06:18:51 condor_read(): select
returned 1
10/09/12 06:18:51 condor_read() failed:
recv(fd=540) returned -1, errno = 10054 , reading 5 bytes from <IP2.12:33239>.
10/09/12 06:18:51 IO: Failed to read
packet header
10/09/12 06:18:51 Stream::get(int) failed
to read padding
10/09/12 06:18:51 DaemonCore: Can't
receive command request from IP2.12 (perhaps a timeout?)
10/09/12 06:18:51 CLOSE <IP.52:62187>
fd=540
10/09/12 06:18:51 Return from Handler
<DaemonCommandProtocol::WaitForSocketData> 0.0000s
10/09/12 06:19:09 Calling Timer handler
8 (Daemons::UpdateCollector())
10/09/12 06:19:09 enter Daemons::UpdateCollector
10/09/12 06:19:09 Trying to update collector
<IP.145:9618>
10/09/12 06:19:09 Attempting to send
update via UDP to collector CENTRALMANAGER.gs.doi.net <IP.145:9618>
10/09/12 06:19:09 SECMAN: command 2
UPDATE_MASTER_AD to collector CENTRALMANAGER.gs.doi.net from UDP port 54871
(non-blocking).
10/09/12 06:19:09 SECMAN: using session
CentralManager:4444:1349777948:1721 for {<IP.145:9618>,<2>}.
10/09/12 06:19:09 SECMAN: found cached
session id CentralManager:4444:1349777948:1721 for {<IP.145:9618>,<2>}.
Authentication = "YES"
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
10/09/12 06:22:36 Received HTTP GET
connection from <IP2.12:36427> -- DENIED because ENABLE_WEB_SERVER=FALSE
10/09/12 06:22:36 condor_read(fd=1160
<IP2.12:36427>,,size=5,timeout=1,flags=0)
10/09/12 06:22:36 condor_read(): fd=1160
10/09/12 06:22:36 condor_read(): select
returned 1
10/09/12 06:22:36 IO: Incoming packet
header unrecognized
10/09/12 06:22:36 Stream::get(int) failed
to read padding
10/09/12 06:22:36 DaemonCore: Can't
receive command request from IP2.12 (perhaps a timeout?)
10/09/12 06:22:36 CLOSE <IP.52:62187>
fd=1160
10/09/12 06:22:36 Calling Handler <DaemonCommandProtocol::WaitForSocketData>
(2)
10/09/12 06:22:36 condor_read(fd=312
<IP2.12:36425>,,size=4,timeout=1,flags=2)
10/09/12 06:22:36 condor_read(): fd=312
10/09/12 06:22:36 condor_read(): select
returned 1
10/09/12 06:22:36 condor_read() failed:
recv(fd=312) returned -1, errno = 10054 , reading 4 bytes from <IP2.12:36425>.
10/09/12 06:22:36 condor_read(fd=312
<IP2.12:36425>,,size=5,timeout=1,flags=0)
10/09/12 06:22:36 condor_read(): fd=312
10/09/12 06:22:36 condor_read(): select
returned 1
10/09/12 06:22:36 condor_read() failed:
recv(fd=312) returned -1, errno = 10054 , reading 5 bytes from <IP2.12:36425>.
10/09/12 06:22:36 IO: Failed to read
packet header
10/09/12 06:22:36 Stream::get(int) failed
to read padding
10/09/12 06:22:36 DaemonCore: Can't
receive command request from IP2.12 (perhaps a timeout?)
10/09/12 06:22:36 CLOSE <IP.52:62187>
fd=312
10/09/12 06:22:36 Return from Handler
<DaemonCommandProtocol::WaitForSocketData> 0.0000s
10/09/12 06:23:09 ACCEPT bound to <IP.52:62187>
fd=312 peer=<IP2.12:36764>
10/09/12 06:23:09 Calling Handler <DaemonCommandProtocol::WaitForSocketData>
(2)
10/09/12 06:23:09 condor_read(fd=312
<IP2.12:36764>,,size=4,timeout=1,flags=2)
10/09/12 06:23:09 condor_read(): fd=312
10/09/12 06:23:09 condor_read(): select
returned 1
10/09/12 06:23:09 condor_read() failed:
recv(fd=312) returned -1, errno = 10054 , reading 4 bytes from <IP2.12:36764>.
10/09/12 06:23:09 condor_read(fd=312
<IP2.12:36764>,,size=5,timeout=1,flags=0)
10/09/12 06:23:09 condor_read(): fd=312
10/09/12 06:23:09 condor_read(): select
returned 1
10/09/12 06:23:09 condor_read() failed:
recv(fd=312) returned -1, errno = 10054 , reading 5 bytes from <IP2.12:36764>.
10/09/12 06:23:09 IO: Failed to read
packet header
10/09/12 06:23:09 Stream::get(int) failed
to read padding
10/09/12 06:23:09 DaemonCore: Can't
receive command request from IP2.12 (perhaps a timeout?)
10/09/12 06:23:09 CLOSE <IP.52:62187>
fd=312
10/09/12 06:23:09 Return from Handler
<DaemonCommandProtocol::WaitForSocketData> 0.0000s
10/09/12 06:23:16 Calling Timer handler
7141 (dc_touch_log_file)
10/09/12 06:23:16 Return from Timer
handler 7141 (dc_touch_log_file)
10/09/12 06:23:21 Calling Timer handler
2 (check_session_cache)
10/09/12 06:23:21 Return from Timer
handler 2 (check_session_cache)
10/09/12 06:23:21 Calling Timer handler
10 (Daemons::CheckForNewExecutable())
10/09/12 06:23:21 enter Daemons::CheckForNewExecutable
10/09/12 06:23:21 Time stamp of running
C:/Condor/bin/condor_master.exe: 1344461336
10/09/12 06:23:21 GetTimeStamp returned:
1344461336
10/09/12 06:23:21 Return from Timer
handler 10 (Daemons::CheckForNewExecutable())
10/09/12 06:23:23 Calling Timer handler
6 (KillFamily::takesnapshot)
10/09/12 06:23:23 Return from Timer
handler 6 (KillFamily::takesnapshot)
10/09/12 06:23:23 Calling Timer handler
7 (KillFamily::takesnapshot)
10/09/12 06:23:23 Return from Timer
handler 7 (KillFamily::takesnapshot)
10/09/12 06:24:09 Calling Timer handler
8 (Daemons::UpdateCollector())
10/09/12 06:24:09 enter Daemons::UpdateCollector
From:
Tim St Clair <tstclair@xxxxxxxxxx>
To:
Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Date:
10/04/2012 08:02 AM
Subject:
Re: [Condor-users] Schedd daemon problems
on Windows, Condor 7.8.2
Sent by:
condor-users-bounces@xxxxxxxxxxx
inline below
----- Original Message -----
> From: "Michael O'Donnell" <odonnellm@xxxxxxxx>
> To: condor-users@xxxxxxxxxxx
> Sent: Thursday, October 4, 2012 7:43:37 AM
> Subject: [Condor-users] Schedd daemon problems on Windows, Condor
> 7.8.2
> I have a Windows pool via Condor 7.8.2 and I have about 6 submit
> machines. After updating to 7.8.2 I started noticing that the
> central manager was not tracking the schedd machines well.
So your entire pool is now 7.8.2 ?
> I am also
> noticing errors in the schedlog files. If I restart the services on
> these machines everything works fine for hours but then at some
> point they drop out of the pool. The schedd daemon never crashes but
> the central manager cannot identify them. If I reboot, they show
> back up but within 12 hours the central manager drops them again.
> Here is an error message I found in the schedlog for one of the
> submit machines.
> 10/03/12 08:14:56 (pid:7848) History file rotation is enabled.
> 10/03/12 08:14:56 (pid:7848) Maximum history file size is: 4000000
> bytes
> 10/03/12 08:14:56 (pid:7848) Number of rotated history files is: 5
> 10/03/12 08:14:56 (pid:7848) LISTEN <159.xxx.xxx.xxx:62494>
fd=708
> 10/03/12 08:14:56 (pid:7848) my_popen: CreateProcess failed
> 10/03/12 08:14:56 (pid:7848) Failed to execute
> C:/Condor/bin/condor_shadow.std.exe, ignoring
^ not an issue imho as std universe is unsupported on windows iirc.
> Another error I have seen:
> 10/03/12 07:14:19 (pid:3788) condor_read() failed: recv(fd=936)
> returned -1, errno = 10054 , reading 5 bytes from
> <159.xxx.xxx.xxx:51513>.
> 10/03/12 07:14:19 (pid:3788) IO: Failed to read packet header
> 10/03/12 07:14:19 (pid:3788) Stream::get(int) failed to read padding
> 10/03/12 07:14:19 (pid:3788) Socket activated, but could not read
> command
> 10/03/12 07:14:19 (pid:3788) (Negotiator probably invalidated cached
> socket)
> 10/03/12 07:14:19 (pid:3788) CLOSE <159.xxx.xxx.xx:59120> fd=936
^^ This seems to point towards the issue you are seeing, could you capture
more full log snap with D_FULLDEBUG?
> Error 10054 via MSDN:
> Connection reset by peer.
> An existing connection was forcibly closed by the remote host. This
> normally results if the peer application on the remote host is
> suddenly stopped, the host is rebooted, the host or remote network
> interface is disabled, or the remote host uses a hard close (see
> setsockopt for more information on the SO_LINGER option on the
> remote socket). This error may also result if a connection was
> broken due to keep-alive activity detecting a failure while one or
> more operations are in progress. Operations that were in progress
> fail with WSAENETRESET. Subsequent operations fail with
> WSAECONNRESET.
> Is anyone else seeing this or have thoughts as to what the problem
> could be. I am not sure if this is the schedd or negotiator on the
> central manager. I am not seeing an errors with our network or for
> the 2 VM submit machines that can more easily monitor.
> thank you!
> mike
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users