Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2

Date: Thu, 4 Oct 2012 08:08:20 -0600
From: "Michael O'Donnell" <odonnellm@xxxxxxxx>
Subject: Re: [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2

Thank you. I will work on collecting the information (will try to wait for submit machines to drop) and get back to you. Just about all my machines in the pool (~100) are 7.8.2. There are definitely a handful that are 7.6.6 but the central manager and submit machines are 7.8.2.

thanks again,
mike

From:	Tim St Clair <tstclair@xxxxxxxxxx>
To:	Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Date:	10/04/2012 08:02 AM
Subject:	Re: [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2
Sent by:	condor-users-bounces@xxxxxxxxxxx

inline below ----- Original Message ----- > From: "Michael O'Donnell" <odonnellm@xxxxxxxx> > To: condor-users@xxxxxxxxxxx > Sent: Thursday, October 4, 2012 7:43:37 AM > Subject: [Condor-users] Schedd daemon problems on Windows, Condor > 7.8.2 > I have a Windows pool via Condor 7.8.2 and I have about 6 submit > machines. After updating to 7.8.2 I started noticing that the > central manager was not tracking the schedd machines well. So your entire pool is now 7.8.2 ? > I am also > noticing errors in the schedlog files. If I restart the services on > these machines everything works fine for hours but then at some > point they drop out of the pool. The schedd daemon never crashes but > the central manager cannot identify them. If I reboot, they show > back up but within 12 hours the central manager drops them again. > Here is an error message I found in the schedlog for one of the > submit machines. > 10/03/12 08:14:56 (pid:7848) History file rotation is enabled. > 10/03/12 08:14:56 (pid:7848) Maximum history file size is: 4000000 > bytes > 10/03/12 08:14:56 (pid:7848) Number of rotated history files is: 5 > 10/03/12 08:14:56 (pid:7848) LISTEN <159.xxx.xxx.xxx:62494> fd=708 > 10/03/12 08:14:56 (pid:7848) my_popen: CreateProcess failed > 10/03/12 08:14:56 (pid:7848) Failed to execute > C:/Condor/bin/condor_shadow.std.exe, ignoring ^ not an issue imho as std universe is unsupported on windows iirc. > Another error I have seen: > 10/03/12 07:14:19 (pid:3788) condor_read() failed: recv(fd=936) > returned -1, errno = 10054 , reading 5 bytes from > <159.xxx.xxx.xxx:51513>. > 10/03/12 07:14:19 (pid:3788) IO: Failed to read packet header > 10/03/12 07:14:19 (pid:3788) Stream::get(int) failed to read padding > 10/03/12 07:14:19 (pid:3788) Socket activated, but could not read > command > 10/03/12 07:14:19 (pid:3788) (Negotiator probably invalidated cached > socket) > 10/03/12 07:14:19 (pid:3788) CLOSE <159.xxx.xxx.xx:59120> fd=936 ^^ This seems to point towards the issue you are seeing, could you capture more full log snap with D_FULLDEBUG? > Error 10054 via MSDN: > Connection reset by peer. > An existing connection was forcibly closed by the remote host. This > normally results if the peer application on the remote host is > suddenly stopped, the host is rebooted, the host or remote network > interface is disabled, or the remote host uses a hard close (see > setsockopt for more information on the SO_LINGER option on the > remote socket). This error may also result if a connection was > broken due to keep-alive activity detecting a failure while one or > more operations are in progress. Operations that were in progress > fail with WSAENETRESET. Subsequent operations fail with > WSAECONNRESET. > Is anyone else seeing this or have thoughts as to what the problem > could be. I am not sure if this is the schedd or negotiator on the > central manager. I am not seeing an errors with our network or for > the 2 VM submit machines that can more easily monitor. > thank you! > mike > _______________________________________________ > Condor-users mailing list > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx > with a > subject: Unsubscribe > You can also unsubscribe by visiting >https://lists.cs.wisc.edu/mailman/listinfo/condor-users> The archives can be found at: >https://lists.cs.wisc.edu/archive/condor-users/_______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visitinghttps://lists.cs.wisc.edu/mailman/listinfo/condor-usersThe archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/

References:
- [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2
  - From: Michael O'Donnell
- Re: [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2
  - From: Tim St Clair

Prev by Date: Re: [Condor-users] Hostname problem, OSX 10.5
Next by Date: [Condor-users] Job Resubmission
Previous by thread: Re: [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2
Next by thread: Re: [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Schedd daemon problems on Windows, Condor 7.8.2