This from the starter log 05/18/17 14:07:17 (pid:6932) condor_read() failed: recv(fd=772) returned -1, errno = 10054 , reading 5 bytes from <10.85.1.216:55287>. that error code is 10054 : An existing connection was forcibly closed by the remote host. This from the shadow log 05/18/17 14:07:17 (4.35) (2912): condor_read() failed: recv(fd=552) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.224:51523>. That error code is 10053 : An established connection was aborted by the software in your host machine. This happened while transferring a file that is being read from a file share. \\lyta\tomodev\USER\Run_IC10.0.2_base\MC3ADV.DLL That error code, combined with the fact that the file itself is on a file share suggests to me that the problem is that
\\lyta is closing the connection when we try and read from that file.
I would suggest that you look at server logs on that file server to see if (and why) that is happening. Perhaps it can’t handle as many simultaneously connections as HTCondor is trying to use? -tj From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of Dubinski, Robert Hello, We are running Condor on Windows and having some issues. Currently we have 8.6.3 installed, with central manager, submit, and execute nodes separated. All these machines are running Windows 7 and Condor was installed using the 64-bit MSI package. Adding our user credentials to CREDD and
submitting jobs goes OK, but performance is slow compared to running apps directly on execute nodes outside of Condor. Scanning through the Condor logs, errors are seen relating to file transfers. In StarterLogs on the execute nodes, we see many sections like the following: 05/18/17 14:07:17 (pid:6932) condor_read(): timeout reading 5 bytes from <10.85.1.216:9618>. 05/18/17 14:07:17 (pid:6932) IO: Failed to read packet header 05/18/17 14:07:17 (pid:6932) Failed to receive filesize in ReliSock::get_file 05/18/17 14:07:17 (pid:6932) DoDownload: STARTER at 10.85.1.224 failed to receive file C:\condor\execute\dir_6932\MC3ADV.DLL 05/18/17 14:07:17 (pid:6932) File transfer failed (status=0). 05/18/17 14:07:17 (pid:6932) ERROR "Failed to transfer files" at line 2364 in file C:\condor\execute\dir_13584\sources\src\condor_starter.V6.1\jic_shadow.cpp 05/18/17 14:07:17 (pid:6932) ShutdownFast all jobs. 05/18/17 14:07:17 (pid:6932) condor_read() failed: recv(fd=772) returned -1, errno = 10054 , reading 5 bytes from <10.85.1.216:55287>. 05/18/17 14:07:17 (pid:6932) IO: Failed to read packet header 05/18/17 14:07:17 (pid:6932) Lost connection to shadow, waiting 2400 secs for reconnect 05/18/17 14:39:02 (pid:6624) condor_read(): timeout reading 5 bytes from <10.85.1.216:9618>. 05/18/17 14:39:02 (pid:6624) IO: Failed to read packet header 05/18/17 14:39:02 (pid:6624) Failed to receive filesize in ReliSock::get_file 05/18/17 14:39:02 (pid:6624) DoDownload: STARTER at 10.85.1.224 failed to receive file C:\condor\execute\dir_6624\PICN20.DLL 05/18/17 14:39:02 (pid:6624) File transfer failed (status=0). 05/18/17 14:39:02 (pid:6624) ERROR "Failed to transfer files" at line 2364 in file C:\condor\execute\dir_13584\sources\src\condor_starter.V6.1\jic_shadow.cpp 05/18/17 14:39:02 (pid:6624) ShutdownFast all jobs. 05/18/17 14:39:03 (pid:6624) condor_read() failed: recv(fd=1156) returned -1, errno = 10054 , reading 5 bytes from <10.85.1.216:57504>. 05/18/17 14:39:03 (pid:6624) IO: Failed to read packet header 05/18/17 14:39:03 (pid:6624) Lost connection to shadow, waiting 2400 secs for reconnect 05/18/17 14:44:21 (pid:2760) condor_read(): timeout reading 5 bytes from <10.85.1.216:9618>. 05/18/17 14:44:21 (pid:2760) Failed to receive filesize in ReliSock::get_file 05/18/17 14:44:21 (pid:2760) DoDownload: STARTER at 10.85.1.224 failed to receive file C:\condor\execute\dir_2760\iconv.dll 05/18/17 14:44:21 (pid:2760) File transfer failed (status=0). 05/18/17 14:44:21 (pid:2760) ERROR "Failed to transfer files" at line 2364 in file C:\condor\execute\dir_13584\sources\src\condor_starter.V6.1\jic_shadow.cpp 05/18/17 14:44:21 (pid:2760) ShutdownFast all jobs. 05/18/17 14:44:21 (pid:2760) condor_read() failed: recv(fd=1200) returned -1, errno = 10054 , reading 5 bytes from <10.85.1.216:58028>. 05/18/17 14:44:21 (pid:2760) IO: Failed to read packet header 05/18/17 14:44:21 (pid:2760) Lost connection to shadow, waiting 2400 secs for reconnect05/18/17 14:44:21 (pid:2760) IO: Failed to read packet header Meanwhile, in the ShadowLog on the submit node, the failure to transmit files is also seen all about: 05/18/17 14:07:17 (4.34) (4588): ReliSock: put_file: TransmitFile() failed, errno=10022 05/18/17 14:07:17 (4.35) (2912): ReliSock: put_file: TransmitFile() failed, errno=10022 05/18/17 14:07:17 (4.35) (2912): condor_read() failed: recv(fd=552) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.224:51523>. 05/18/17 14:07:17 (4.35) (2912): IO: Failed to read packet header 05/18/17 14:07:17 (4.35) (2912): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.224:51523>: error sending
\\lyta\tomodev\USER\Run_IC10.0.2_base\MC3ADV.DLL 05/18/17 14:07:17 (4.34) (4588): condor_read() failed: recv(fd=420) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.224:51534>. 05/18/17 14:07:17 (4.34) (4588): IO: Failed to read packet header 05/18/17 14:07:17 (4.34) (4588): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.224:51534>: error sending
\\lyta\tomodev\USER\Run_IC10.0.2_base\MC3ADV.DLL 05/18/17 14:07:17 (4.35) (2912): ERROR "Error from
slot4@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp 05/18/17 14:07:17 (4.34) (4588): ERROR "Error from
slot1@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp 05/18/17 14:07:17 (4.36) (2292): ReliSock: put_file: TransmitFile() failed, errno=10022 05/18/17 14:07:17 (4.36) (2292): condor_read() failed: recv(fd=552) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.237:50293>. 05/18/17 14:07:17 (4.36) (2292): IO: Failed to read packet header 05/18/17 14:07:17 (4.36) (2292): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50293>: error sending
\\lyta\tomodev\USER\Run_IC10.0.2_base\MC3ADV.DLL 05/18/17 14:07:17 (4.36) (2292): ERROR "Error from
slot2@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo 05/18/17 14:08:29 (4.27) (3312): Request to run on
slot3@xxxxxxxxxxxxxxxxxxxxxxx <10.85.1.237:49312?addrs=10.85.1.237-49312> was ACCEPTED 05/18/17 14:08:32 (4.38) (5560): Request to run on
slot2@xxxxxxxxxxxxxxxxxxxxxxx <10.85.1.224:9618?addrs=10.85.1.224-9618&noUDP&sock=4204_6b8e_3> was ACCEPTED 05/18/17 14:08:56 (4.39) (5804): ReliSock: put_file: TransmitFile() failed, errno=10054 05/18/17 14:08:56 (4.33) (3684): ReliSock: put_file: TransmitFile() failed, errno=10054 05/18/17 14:08:56 (4.33) (3684): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50376>: error sending
\\lyta\tomodev\USER\Run_IC10.0.2_base\libxml2.dll; STARTER at 10.85.1.237 failed to receive file C:\condor\execute\dir_2480\libxml2.dll 05/18/17 14:08:56 (4.35) (4132): ReliSoc2k: put_file: TransmitFile() failed, errno=10022 05/18/17 14:08:56 (4.35) (4132): condor_read() failed: recv(fd=576) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.237:50389>. 05/18/17 14:08:56 (4.35) (4132): IO: Failed to read packet header 05/18/17 14:08:56 (4.35) (4132): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50389>: error sending
\\lyta\tomodev\USER\Run_IC10.0.2_base\libxml2.dll 05/18/17 14:08:56 (4.39) (5804): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50402>: error sending
\\lyta\tomodev\USER\Run_IC10.0.2_base\libxml2.dll; STARTER at 10.85.1.237 failed to receive file C:\condor\execute\dir_1296\libxml2.dll 05/18/17 14:08:56 (4.33) (3684): ERROR "Error from
slot2@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp 05/18/17 14:08:56 (4.36) (2088): ReliSock: put_file: TransmitFile() failed, errno=10054 05/18/17 14:08:56 (4.36) (2088): condor_read() failed: recv(fd=556) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.224:51625>. 05/18/17 14:08:56 (4.36) (2088): IO: Failed to read packet header 05/18/17 14:08:56 (4.36) (2088): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.224:51625>: error sending
\\lyta\tomodev\USER\Run_IC10.0.2_base\libxml2.dll 05/18/17 14:08:56 (4.35) (4132): ERROR "Error from
slot1@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp 05/18/17 14:08:56 (4.36) (2088): ERROR "Error from
slot4@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp 05/18/17 14:08:56 (4.39) (5804): ERROR "Error from
slot4@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp 05/18/17 14:08:57 ****************************************************** 05/18/17 14:09:17 Initializing a VANILLA shadow for job 4.41 05/18/17 14:09:27 (4.41) (6140): Request to run on
slot1@xxxxxxxxxxxxxxxxxxxxxxx <10.85.1.224:9618?addrs=10.85.1.224-9618&noUDP&sock=4204_6b8e_3> was ACCEPTED 05/18/17 14:09:28 (4.35) (5432): ReliSock: put_file: TransmitFile() failed, errno=10022 05/18/17 14:09:29 (4.35) (5432): condor_read() failed: recv(fd=552) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.237:50431>. 05/18/17 14:09:29 (4.35) (5432): IO: Failed to read packet header 05/18/17 14:09:29 (4.35) (5432): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50431>: error sending
\\lyta\tomodev\USER\Run_IC10.0.2_base\AlgoMammo.dll 05/18/17 14:09:29 (4.35) (5432): ERROR "Error from
slot4@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp 05/18/17 14:09:29 ****************************************************** Eventually the submitted jobs do complete, but with all the failures, it’s much later than would be expected if things had executed without issue. This issue is happening to all our users, who run similar, but differing versions of their
application. Any thoughts on what might be causing this? Or, what might we do to troubleshoot? Thank-you, Robert |