Subject: Re: [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2
Thank you. I will work on collecting the
information (will try to wait for submit machines to drop) and get back
to you. Just about all my machines in the pool (~100) are 7.8.2. There
are definitely a handful that are 7.6.6 but the central manager and submit
machines are 7.8.2.
thanks again,
mike
From:
Tim St Clair <tstclair@xxxxxxxxxx>
To:
Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Date:
10/04/2012 08:02 AM
Subject:
Re: [Condor-users] Schedd daemon problems
on Windows, Condor 7.8.2
Sent by:
condor-users-bounces@xxxxxxxxxxx
inline below
----- Original Message -----
> From: "Michael O'Donnell" <odonnellm@xxxxxxxx>
> To: condor-users@xxxxxxxxxxx
> Sent: Thursday, October 4, 2012 7:43:37 AM
> Subject: [Condor-users] Schedd daemon problems on Windows, Condor
> 7.8.2
> I have a Windows pool via Condor 7.8.2 and I have about 6 submit
> machines. After updating to 7.8.2 I started noticing that the
> central manager was not tracking the schedd machines well.
So your entire pool is now 7.8.2 ?
> I am also
> noticing errors in the schedlog files. If I restart the services on
> these machines everything works fine for hours but then at some
> point they drop out of the pool. The schedd daemon never crashes but
> the central manager cannot identify them. If I reboot, they show
> back up but within 12 hours the central manager drops them again.
> Here is an error message I found in the schedlog for one of the
> submit machines.
> 10/03/12 08:14:56 (pid:7848) History file rotation is enabled.
> 10/03/12 08:14:56 (pid:7848) Maximum history file size is: 4000000
> bytes
> 10/03/12 08:14:56 (pid:7848) Number of rotated history files is: 5
> 10/03/12 08:14:56 (pid:7848) LISTEN <159.xxx.xxx.xxx:62494>
fd=708
> 10/03/12 08:14:56 (pid:7848) my_popen: CreateProcess failed
> 10/03/12 08:14:56 (pid:7848) Failed to execute
> C:/Condor/bin/condor_shadow.std.exe, ignoring
^ not an issue imho as std universe is unsupported on windows iirc.
> Another error I have seen:
> 10/03/12 07:14:19 (pid:3788) condor_read() failed: recv(fd=936)
> returned -1, errno = 10054 , reading 5 bytes from
> <159.xxx.xxx.xxx:51513>.
> 10/03/12 07:14:19 (pid:3788) IO: Failed to read packet header
> 10/03/12 07:14:19 (pid:3788) Stream::get(int) failed to read padding
> 10/03/12 07:14:19 (pid:3788) Socket activated, but could not read
> command
> 10/03/12 07:14:19 (pid:3788) (Negotiator probably invalidated cached
> socket)
> 10/03/12 07:14:19 (pid:3788) CLOSE <159.xxx.xxx.xx:59120> fd=936
^^ This seems to point towards the issue you are seeing, could you capture
more full log snap with D_FULLDEBUG?
> Error 10054 via MSDN:
> Connection reset by peer.
> An existing connection was forcibly closed by the remote host. This
> normally results if the peer application on the remote host is
> suddenly stopped, the host is rebooted, the host or remote network
> interface is disabled, or the remote host uses a hard close (see
> setsockopt for more information on the SO_LINGER option on the
> remote socket). This error may also result if a connection was
> broken due to keep-alive activity detecting a failure while one or
> more operations are in progress. Operations that were in progress
> fail with WSAENETRESET. Subsequent operations fail with
> WSAECONNRESET.
> Is anyone else seeing this or have thoughts as to what the problem
> could be. I am not sure if this is the schedd or negotiator on the
> central manager. I am not seeing an errors with our network or for
> the 2 VM submit machines that can more easily monitor.
> thank you!
> mike
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users