ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): condor_read() failed: recv(fd=3) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from startd
slot3@xxxxxxxxxxxxxxxxxxxxxxxx-research.com.
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): IO: Failed to read packet header
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): Can no longer talk to condor_starter <
10.40.241.131:9618>
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): Trying to reconnect to disconnected job
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): LastJobLeaseRenewal: 1478275311 Fri Nov 4 12:01:51 2016
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): JobLeaseDuration: 2400 seconds
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): JobLeaseDuration remaining: 2040
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): Attempting to locate disconnected starter
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): condor_read() failed: recv(fd=3) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from startd
slot3@xxxxxxxxxxxxxxxxxxxxxxxx-research.com.
My question is, What can we configure on the hosts to deal with startd's that (seemingly) go silent? I've found various old posts on the list that suggest that there was talk about automatically moving jobs from Running to Idle after exec nodes failed; did anything come of that?