Re: [HTCondor-users] condor glidein ccb condor

Re: [HTCondor-users] condor glidein ccb condor_write() broken pipe

Date: Wed, 10 Jul 2013 16:31:36 -0500

Subject: Re: [HTCondor-users] condor glidein ccb condor_write() broken pipe

Hi Jason,

The starter probably received SIGQUIT from its parent, the startd. The startd log may indicate why. Also, the shadow log snippet you posted appears to be from a new shadow that was started after the previous one that was transferring the file exited. Look back further in the log for the shadow that was engaged in the file transfer to see what happened to it. The fact that things aborted around 1 hour after the file transfer started makes me suspect you are hitting another timeout.

In your present case, it may be beneficial to decrease the number of concurrent file transfers:

MAX_CONCURRENT_UPLOADS = 3

The default is 10.

--Dan

On 7/10/13 1:54 PM, Jason wrote:

Dan,

First off, thanks for your help.

The submit node bandwidth is a bottleneck (1GigE). And it would seem that adding "STARTER_TIMEOUT_MULTIPLIER=5" helps, but I am not out of the woods yet. I am now getting the following on the execute node:
07/10/13 12:22:03 Sending GoAhead for 152.19.197.180 to send /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_5586/130503_UNC11-SN627_0296_BC228JACXX_AGTTCC_L001.fixed-rg.deduped.realign.fix.pr.bam and all further files.
07/10/13 12:22:03 Received GoAhead from peer to receive /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_5586/130503_UNC11-SN627_0296_BC228JACXX_AGTTCC_L001.fixed-rg.deduped.realign.fix.pr.bam.
07/10/13 12:22:03 get_file(): going to write to filename /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_5586/130503_UNC11-SN627_0296_BC228JACXX_AGTTCC_L001.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 12:22:03 get_file: Receiving 2628616372 bytes
07/10/13 12:41:44 DaemonCore: in SendAliveToParent()
07/10/13 12:41:44 File descriptor limits: max 10000, safe 8000
07/10/13 12:41:44 DaemonCore: Leaving SendAliveToParent() - pending
07/10/13 12:41:44 Completed DC_CHILDALIVE to daemon at <10.9.15.247:40306>
07/10/13 12:41:51 CCBListener: sent heartbeat to server.
07/10/13 12:42:02 CCBListener: received heartbeat from server.
07/10/13 13:01:25 DaemonCore: in SendAliveToParent()
07/10/13 13:01:25 DaemonCore: Leaving SendAliveToParent() - pending
07/10/13 13:01:25 Completed DC_CHILDALIVE to daemon at <10.9.15.247:40306>
07/10/13 13:02:02 CCBListener: sent heartbeat to server.
07/10/13 13:02:03 CCBListener: received heartbeat from server.
07/10/13 13:21:07 DaemonCore: in SendAliveToParent()
07/10/13 13:21:07 DaemonCore: Leaving SendAliveToParent() - pending
07/10/13 13:21:07 Completed DC_CHILDALIVE to daemon at <10.9.15.247:40306>
07/10/13 13:21:52 Got SIGQUIT.  Performing fast shutdown.
07/10/13 13:21:52 ShutdownFast all jobs.
07/10/13 13:21:52 Got ShutdownFast when no jobs running.
07/10/13 13:21:52 HOOK_JOB_EXIT not configured.
07/10/13 13:21:52 Initializing Directory: curr_dir = /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_5586
07/10/13 13:21:52 Entering JICShadow::updateShadow()
07/10/13 13:21:52 condor_write(): Socket closed when trying to write 146 bytes to <152.19.197.180:40797>, fd is 11
07/10/13 13:21:52 Buf::write(): condor_write() failed
07/10/13 13:21:52 Sent job ClassAd update to startd.
07/10/13 13:21:52 JICShadow::updateShadow(): failed to send update
And in the submit node's ShadowLog:
07/10/13 13:21:51 Initializing a VANILLA shadow for job 598364.0
07/10/13 13:21:52 (598364.0) (23868): Request to run on slot1@xxxxxxxxxxxxxxxxx <10.9.15.247:40306?CCBID=152.19.197.180:9618#171843&noUDP> was REFUSED
07/10/13 13:21:52 (598364.0) (23868): Job 598364.0 is being evicted from slot1@xxxxxxxxxxxxxxxxx
07/10/13 13:21:52 (598364.0) (23868): logEvictEvent with unknown reason (108), aborting
07/10/13 13:21:52 (598364.0) (23868): **** condor_shadow (condor_SHADOW) pid 23868 EXITING WITH STATUS 108
While watching the execute directory, the file is never fully transfered. Before it finishes, looks like a SIGQUIT is issued. How can I know what issued the SIGQUIT?

Thanks,
Jason

On 07/10/2013 10:54 AM, Dan Bradley wrote:
Hi Jason,

The root of the problem is in this message from your starter log:

07/10/13 08:49:13 condor_read(): timeout reading 65536 bytes from daemon at <152.19.197.180:40093>.

Judging from the timestamps in the log file, the timeout was 60 seconds. This doesn't have anything to do with CCB.

You could increase the timeout by setting something like

STARTER_TIMEOUT_MULTIPLIER=5

However, 60 seconds is a long time to transmit 65536 bytes. Has your submit node maxing out its network or disk bandwidth? In HTCondor 8.0, there are some attributes in the schedd ClassAd that monitor bandwidth usage by file transfer:

condor_status -schedd -l | grep BytesPerSecond

If things other than HTCondor file transfer are using bandwidth on the submit machine, you will need to look at general system statistics to see the effect of those.

Of course, the submit node isn't the only place where a bottleneck might appear. The site where the glideins are running could also be maxed out.

--Dan

On 7/10/13 8:27 AM, Jason wrote:

Hi all,

I am using Condor Glideins with CCB & am experiencing a problem where partial file transfer is occuring, but then fails with the following on the central-manager side:

07/10/13 09:04:11 DaemonCore: command socket at <152.19.197.180:40872?noUDP>
07/10/13 09:04:11 DaemonCore: private command socket at <152.19.197.180:40872>
07/10/13 09:04:11 Setting maximum accepts per cycle 4.
07/10/13 09:04:11 Initializing a VANILLA shadow for job 598057.0
07/10/13 09:04:11 (598042.0) (14010): condor_write() failed: send() 65536 bytes to <152.54.2.30:40808> returned -1, timeout=0, errno=32 Broken pipe.
07/10/13 09:04:11 (598042.0) (14010): ReliSock::put_bytes_nobuffer: Send failed.
07/10/13 09:04:11 (598042.0) (14010): ReliSock::put_file: failed to put 65536 bytes (put_bytes_nobuffer() returned -1)
07/10/13 09:04:11 (598042.0) (14010): DoUpload: SHADOW at 152.19.197.180 failed to send file(s) to <152.54.2.30:40808>: error sending /proj/seq/mapseq/RENCI/130508_UNC16-SN851_0242_BC241KACXX/NIDAUCSF/061210Sm/130508_UNC16-SN851_0242_BC\
241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam; STARTER at 10.9.15.247 failed to receive file /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_29411/130508_UNC16-SN851_0242_BC241KACXX_\
GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 09:04:11 (598042.0) (14010): ERROR "Error from slot1@xxxxxxxxxxxxxxxxx: Failed to transfer files" at line 676 in file /home/condor/execute/dir_15857/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp

Here is what I see on the compute node side:

07/10/13 08:48:12 entering FileTransfer::DoDownload sync=0
07/10/13 08:48:13 REMAP: begin with rules:
07/10/13 08:48:13 REMAP: 0: 130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 08:48:13 REMAP: res is 0 -> !
07/10/13 08:48:13 Sending GoAhead for 152.19.197.180 to send /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr\
.bam and all further files.
07/10/13 08:48:13 Received GoAhead from peer to receive /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam.
07/10/13 08:48:13 get_file(): going to write to filename /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 08:48:13 get_file: Receiving 3267697542 bytes
07/10/13 08:49:13 condor_read(): timeout reading 65536 bytes from daemon at <152.19.197.180:40093>.
07/10/13 08:49:13 ReliSock::get_bytes_nobuffer: Failed to receive file.
07/10/13 08:49:13 get_file: wrote 58589184 bytes to file
07/10/13 08:49:13 get_file(): ERROR: received 58589184 bytes, expected 3267697542!
07/10/13 08:49:13 DoDownload: STARTER at 10.9.15.247 failed to receive file /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped\
.realign.fix.pr.bam
07/10/13 08:49:13 DoDownload: exiting at 2213

On the compute node side, I have the following in the condor_config.local:

HIGHPORT=41000
LOWPORT=40000

WANT_UDP_COMMAND_SOCKET=False
UPDATE_COLLECTOR_WITH_TCP=True

USE_CCB="True"
CCB_ADDRESS=$(COLLECTOR_HOST)
PRIVATE_NETWORK_NAME = $(FULL_HOSTNAME)

I am assuming that I have the configuration set up correctly as I am getting a partial download, but something is causing the socket connection to hang/timeout/fail. Any suggestions as to how I can find what is causing the "Broken pipe"?

Thanks,
Jason

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/