Hello, We are running Condor on Windows and having some issues. Currently we have 8.6.3 installed, with central manager, submit, and execute nodes separated. All these machines are running Windows 7 and Condor was installed using the 64-bit MSI package. Adding our user credentials to CREDD and
submitting jobs goes OK, but performance is slow compared to running apps directly on execute nodes outside of Condor. Scanning through the Condor logs, errors are seen relating to file transfers. In StarterLogs on the execute nodes, we see many sections like the following: 05/18/17 14:07:17 (pid:6932) condor_read(): timeout reading 5 bytes from <10.85.1.216:9618>. 05/18/17 14:07:17 (pid:6932) IO: Failed to read packet header 05/18/17 14:07:17 (pid:6932) Failed to receive filesize in ReliSock::get_file 05/18/17 14:07:17 (pid:6932) DoDownload: STARTER at 10.85.1.224 failed to receive file C:\condor\execute\dir_6932\MC3ADV.DLL 05/18/17 14:07:17 (pid:6932) File transfer failed (status=0). 05/18/17 14:07:17 (pid:6932) ERROR "Failed to transfer files" at line 2364 in file C:\condor\execute\dir_13584\sources\src\condor_starter.V6.1\jic_shadow.cpp 05/18/17 14:07:17 (pid:6932) ShutdownFast all jobs. 05/18/17 14:07:17 (pid:6932) condor_read() failed: recv(fd=772) returned -1, errno = 10054 , reading 5 bytes from <10.85.1.216:55287>. 05/18/17 14:07:17 (pid:6932) IO: Failed to read packet header 05/18/17 14:07:17 (pid:6932) Lost connection to shadow, waiting 2400 secs for reconnect 05/18/17 14:39:02 (pid:6624) condor_read(): timeout reading 5 bytes from <10.85.1.216:9618>. 05/18/17 14:39:02 (pid:6624) IO: Failed to read packet header 05/18/17 14:39:02 (pid:6624) Failed to receive filesize in ReliSock::get_file 05/18/17 14:39:02 (pid:6624) DoDownload: STARTER at 10.85.1.224 failed to receive file C:\condor\execute\dir_6624\PICN20.DLL 05/18/17 14:39:02 (pid:6624) File transfer failed (status=0). 05/18/17 14:39:02 (pid:6624) ERROR "Failed to transfer files" at line 2364 in file C:\condor\execute\dir_13584\sources\src\condor_starter.V6.1\jic_shadow.cpp 05/18/17 14:39:02 (pid:6624) ShutdownFast all jobs. 05/18/17 14:39:03 (pid:6624) condor_read() failed: recv(fd=1156) returned -1, errno = 10054 , reading 5 bytes from <10.85.1.216:57504>. 05/18/17 14:39:03 (pid:6624) IO: Failed to read packet header 05/18/17 14:39:03 (pid:6624) Lost connection to shadow, waiting 2400 secs for reconnect 05/18/17 14:44:21 (pid:2760) condor_read(): timeout reading 5 bytes from <10.85.1.216:9618>. 05/18/17 14:44:21 (pid:2760) Failed to receive filesize in ReliSock::get_file 05/18/17 14:44:21 (pid:2760) DoDownload: STARTER at 10.85.1.224 failed to receive file C:\condor\execute\dir_2760\iconv.dll 05/18/17 14:44:21 (pid:2760) File transfer failed (status=0). 05/18/17 14:44:21 (pid:2760) ERROR "Failed to transfer files" at line 2364 in file C:\condor\execute\dir_13584\sources\src\condor_starter.V6.1\jic_shadow.cpp 05/18/17 14:44:21 (pid:2760) ShutdownFast all jobs. 05/18/17 14:44:21 (pid:2760) condor_read() failed: recv(fd=1200) returned -1, errno = 10054 , reading 5 bytes from <10.85.1.216:58028>. 05/18/17 14:44:21 (pid:2760) IO: Failed to read packet header 05/18/17 14:44:21 (pid:2760) Lost connection to shadow, waiting 2400 secs for reconnect05/18/17 14:44:21 (pid:2760) IO: Failed to read packet header Meanwhile, in the ShadowLog on the submit node, the failure to transmit files is also seen all about: 05/18/17 14:07:17 (4.34) (4588): ReliSock: put_file: TransmitFile() failed, errno=10022 05/18/17 14:07:17 (4.35) (2912): ReliSock: put_file: TransmitFile() failed, errno=10022 05/18/17 14:07:17 (4.35) (2912): condor_read() failed: recv(fd=552) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.224:51523>. 05/18/17 14:07:17 (4.35) (2912): IO: Failed to read packet header 05/18/17 14:07:17 (4.35) (2912): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.224:51523>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\MC3ADV.DLL 05/18/17 14:07:17 (4.34) (4588): condor_read() failed: recv(fd=420) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.224:51534>. 05/18/17 14:07:17 (4.34) (4588): IO: Failed to read packet header 05/18/17 14:07:17 (4.34) (4588): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.224:51534>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\MC3ADV.DLL 05/18/17 14:07:17 (4.35) (2912): ERROR "Error from slot4@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp 05/18/17 14:07:17 (4.34) (4588): ERROR "Error from slot1@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp 05/18/17 14:07:17 (4.36) (2292): ReliSock: put_file: TransmitFile() failed, errno=10022 05/18/17 14:07:17 (4.36) (2292): condor_read() failed: recv(fd=552) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.237:50293>. 05/18/17 14:07:17 (4.36) (2292): IO: Failed to read packet header 05/18/17 14:07:17 (4.36) (2292): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50293>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\MC3ADV.DLL 05/18/17 14:07:17 (4.36) (2292): ERROR "Error from slot2@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo 05/18/17 14:08:29 (4.27) (3312): Request to run on slot3@xxxxxxxxxxxxxxxxxxxxxxx <10.85.1.237:49312?addrs=10.85.1.237-49312> was ACCEPTED 05/18/17 14:08:32 (4.38) (5560): Request to run on slot2@xxxxxxxxxxxxxxxxxxxxxxx <10.85.1.224:9618?addrs=10.85.1.224-9618&noUDP&sock=4204_6b8e_3> was ACCEPTED 05/18/17 14:08:56 (4.39) (5804): ReliSock: put_file: TransmitFile() failed, errno=10054 05/18/17 14:08:56 (4.33) (3684): ReliSock: put_file: TransmitFile() failed, errno=10054 05/18/17 14:08:56 (4.33) (3684): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50376>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\libxml2.dll; STARTER at 10.85.1.237 failed to receive file C:\condor\execute\dir_2480\libxml2.dll 05/18/17 14:08:56 (4.35) (4132): ReliSoc2k: put_file: TransmitFile() failed, errno=10022 05/18/17 14:08:56 (4.35) (4132): condor_read() failed: recv(fd=576) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.237:50389>. 05/18/17 14:08:56 (4.35) (4132): IO: Failed to read packet header 05/18/17 14:08:56 (4.35) (4132): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50389>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\libxml2.dll 05/18/17 14:08:56 (4.39) (5804): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50402>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\libxml2.dll; STARTER at 10.85.1.237 failed to receive file C:\condor\execute\dir_1296\libxml2.dll 05/18/17 14:08:56 (4.33) (3684): ERROR "Error from slot2@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp 05/18/17 14:08:56 (4.36) (2088): ReliSock: put_file: TransmitFile() failed, errno=10054 05/18/17 14:08:56 (4.36) (2088): condor_read() failed: recv(fd=556) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.224:51625>. 05/18/17 14:08:56 (4.36) (2088): IO: Failed to read packet header 05/18/17 14:08:56 (4.36) (2088): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.224:51625>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\libxml2.dll 05/18/17 14:08:56 (4.35) (4132): ERROR "Error from slot1@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp 05/18/17 14:08:56 (4.36) (2088): ERROR "Error from slot4@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp 05/18/17 14:08:56 (4.39) (5804): ERROR "Error from slot4@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp 05/18/17 14:08:57 ****************************************************** 05/18/17 14:09:17 Initializing a VANILLA shadow for job 4.41 05/18/17 14:09:27 (4.41) (6140): Request to run on slot1@xxxxxxxxxxxxxxxxxxxxxxx <10.85.1.224:9618?addrs=10.85.1.224-9618&noUDP&sock=4204_6b8e_3> was ACCEPTED 05/18/17 14:09:28 (4.35) (5432): ReliSock: put_file: TransmitFile() failed, errno=10022 05/18/17 14:09:29 (4.35) (5432): condor_read() failed: recv(fd=552) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.237:50431>. 05/18/17 14:09:29 (4.35) (5432): IO: Failed to read packet header 05/18/17 14:09:29 (4.35) (5432): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50431>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\AlgoMammo.dll 05/18/17 14:09:29 (4.35) (5432): ERROR "Error from slot4@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp 05/18/17 14:09:29 ****************************************************** Eventually the submitted jobs do complete, but with all the failures, it’s much later than would be expected if things had executed without issue. This issue is happening to all our users, who run similar, but differing versions of their
application. Any thoughts on what might be causing this? Or, what might we do to troubleshoot? Thank-you, Robert |