Hi Jason,
The root of the problem is in this message from your starter log:
07/10/13 08:49:13 condor_read(): timeout
reading 65536 bytes from daemon at <152.19.197.180:40093>.
Judging from the timestamps in the log file, the timeout was 60
seconds. This doesn't have anything to do with CCB.
You could increase the timeout by setting something like
STARTER_TIMEOUT_MULTIPLIER=5
However, 60 seconds is a long time to transmit 65536 bytes. Has
your submit node maxing out its network or disk bandwidth? In
HTCondor 8.0, there are some attributes in the schedd ClassAd that
monitor bandwidth usage by file transfer:
condor_status -schedd -l | grep
BytesPerSecond
If things other than HTCondor file transfer are using bandwidth on
the submit machine, you will need to look at general system
statistics to see the effect of those.
Of course, the submit node isn't the only place where a bottleneck
might appear. The site where the glideins are running could also be
maxed out.
--Dan
On 7/10/13 8:27 AM, Jason wrote:
Hi
all,
I am using Condor Glideins with CCB & am experiencing a
problem where partial file transfer is occuring, but then fails
with the following on the central-manager side:
07/10/13 09:04:11 DaemonCore: command socket at
<152.19.197.180:40872?noUDP>
07/10/13 09:04:11 DaemonCore: private command socket at
<152.19.197.180:40872>
07/10/13 09:04:11 Setting maximum accepts per cycle 4.
07/10/13 09:04:11 Initializing a VANILLA shadow for job 598057.0
07/10/13 09:04:11 (598042.0) (14010): condor_write() failed:
send() 65536 bytes to <152.54.2.30:40808> returned -1,
timeout=0, errno=32 Broken pipe.
07/10/13 09:04:11 (598042.0) (14010):
ReliSock::put_bytes_nobuffer: Send failed.
07/10/13 09:04:11 (598042.0) (14010): ReliSock::put_file: failed
to put 65536 bytes (put_bytes_nobuffer() returned -1)
07/10/13 09:04:11 (598042.0) (14010): DoUpload: SHADOW at
152.19.197.180 failed to send file(s) to
<152.54.2.30:40808>: error sending
/proj/seq/mapseq/RENCI/130508_UNC16-SN851_0242_BC241KACXX/NIDAUCSF/061210Sm/130508_UNC16-SN851_0242_BC\
241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam; STARTER
at 10.9.15.247 failed to receive file
/projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_29411/130508_UNC16-SN851_0242_BC241KACXX_\
GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 09:04:11 (598042.0) (14010): ERROR "Error from
slot1@xxxxxxxxxxxxxxxxx: Failed to transfer files" at line 676 in
file
/home/condor/execute/dir_15857/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp
Here is what I see on the compute node side:
07/10/13 08:48:12 entering FileTransfer::DoDownload sync=0
07/10/13 08:48:13 REMAP: begin with rules:
07/10/13 08:48:13 REMAP: 0:
130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 08:48:13 REMAP: res is 0 -> !
07/10/13 08:48:13 Sending GoAhead for 152.19.197.180 to send
/projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr\
.bam and all further files.
07/10/13 08:48:13 Received GoAhead from peer to receive
/projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam.
07/10/13 08:48:13 get_file(): going to write to filename
/projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 08:48:13 get_file: Receiving 3267697542 bytes
07/10/13 08:49:13 condor_read(): timeout reading 65536 bytes from
daemon at <152.19.197.180:40093>.
07/10/13 08:49:13 ReliSock::get_bytes_nobuffer: Failed to receive
file.
07/10/13 08:49:13 get_file: wrote 58589184 bytes to file
07/10/13 08:49:13 get_file(): ERROR: received 58589184 bytes,
expected 3267697542!
07/10/13 08:49:13 DoDownload: STARTER at 10.9.15.247 failed to
receive file
/projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped\
.realign.fix.pr.bam
07/10/13 08:49:13 DoDownload: exiting at 2213
On the compute node side, I have the following in the
condor_config.local:
HIGHPORT=41000
LOWPORT=40000
WANT_UDP_COMMAND_SOCKET=False
UPDATE_COLLECTOR_WITH_TCP=True
USE_CCB="True"
CCB_ADDRESS=$(COLLECTOR_HOST)
PRIVATE_NETWORK_NAME = $(FULL_HOSTNAME)
I am assuming that I have the configuration set up correctly as I
am getting a partial download, but something is causing the socket
connection to hang/timeout/fail. Any suggestions as to how I can
find what is causing the "Broken pipe"?
Thanks,
Jason
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
|