Hi Alex You can effectively offload the file transfer from the submit node to the execute node/s by using a batch file wrapper script as your Condor “executable”, e.g. %windir%\system32\net use \\fileserver\path\condor_stuff copy \\fileserver\path\condor_stuff\real.exe . copy \\fileserver\path\condor_stuff\input_data.dat . real.exe copy output_data.dat \\fileserver\path\condor_stuff del /q *.* You can use xcopy to transfer whole folders if necessary. You can also use some error checking, e.g after each copy statement IF %ERRORLEVEL% NEQ 0 EXIT 1 and then use the following in the submit file to rerun the job if the copy fails == 0) Cheers Greg From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of Alexey Smirnov Thanks a lot for your replies! The normal file transfer with "transfer_input_files =
\\fileservername\path\file" works perfect so there is no access problems for an authorized user. The CREDD server is also configured properly and "whoami test" confirms that. The reason why I started to play with URL file transfer mode is a hope to distribute the network load and to avoid all traffic being channeled thru a submitting node. I can explain the problem I'm trying to solve. Recently trying to submit a bundle of 50 jobs with big (~1.5GB) input files we massively got errors: ==== 007 (027.000.000) 03/19 16:27:24 Shadow exception! Error from slot1_15@[...]: Failed to transfer files ==== Smaller bundle (let's say 5 jobs) worked fine. Delays (NextJobStartDelay) helped a bit but many jobs still failed. I think we definitely faced with a network bandwidth bottleneck
resulting in file access timeouts: ================= StarterLog.slot1_15 ================ 03/19/14 16:26:23 Received GoAhead from peer to receive C:\condor\execute\dir_8344\file. 03/19/14 16:26:23 get_file(): going to write to filename C:\condor\execute\dir_8344\file 03/19/14 16:26:25 get_file: Receiving 1320771848 bytes 03/19/14 16:26:55 condor_read(): timeout reading 65536 bytes from <[...]>. 03/19/14 16:26:55 ReliSock::get_bytes_nobuffer: Failed to receive file. 03/19/14 16:26:55 get_file: wrote 0 bytes to file 03/19/14 16:26:55 get_file(): ERROR: received 0 bytes, expected 1320771848! 03/19/14 16:26:55 DoDownload: STARTER at [...] failed to receive file C:\condor\execute\dir_8344\file 03/19/14 16:26:55 DoDownload: exiting at 2215 03/19/14 16:26:55 FileTransfer: created download transfer process with id 6 03/19/14 16:26:55 DaemonCore: in SendAliveToParent() 03/19/14 16:26:55 Completed DC_CHILDALIVE to daemon at <[...]> 03/19/14 16:26:55 DaemonCore: Leaving SendAliveToParent() - success 03/19/14 16:26:55 File transfer failed (status=0). 03/19/14 16:26:55 Calling client FileTransfer handler function. 03/19/14 16:26:55 ERROR "Failed to transfer files" at line 2050 in file c:\condor\execute\dir_29540\userdir\src\condor_starter.v6.1\jic_shadow.cpp 03/19/14 16:26:55 ShutdownFast all jobs. 03/19/14 16:26:55 Got ShutdownFast when no jobs running. ================================================ All input files are located on a really good and fast network storage so there should be no issues from that side. I see the only weak link in a submitting node which at first
needs to download ~75GB from the network storage and then to upload it on executing nodes. Alexey On Tue, Apr 1, 2014 at 8:05 PM, Zachary Miller <zmiller@xxxxxxxxxxx> wrote: On Tue, Apr 01, 2014 at 10:14:55AM -0500, Todd Tannenbaum wrote: I would agree with all that Todd said.
|