Dan,
First off, thanks for your help. The submit node bandwidth is a bottleneck (1GigE). And it would seem that adding "STARTER_TIMEOUT_MULTIPLIER=5" helps, but I am not out of the woods yet. I am now getting the following on the execute node: 07/10/13 12:22:03 Sending GoAhead for 152.19.197.180 to send /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_5586/130503_UNC11-SN627_0296_BC228JACXX_AGTTCC_L001.fixed-rg.deduped.realign.fix.pr.bam and all further files. 07/10/13 12:22:03 Received GoAhead from peer to receive /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_5586/130503_UNC11-SN627_0296_BC228JACXX_AGTTCC_L001.fixed-rg.deduped.realign.fix.pr.bam. 07/10/13 12:22:03 get_file(): going to write to filename /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_5586/130503_UNC11-SN627_0296_BC228JACXX_AGTTCC_L001.fixed-rg.deduped.realign.fix.pr.bam 07/10/13 12:22:03 get_file: Receiving 2628616372 bytes 07/10/13 12:41:44 DaemonCore: in SendAliveToParent() 07/10/13 12:41:44 File descriptor limits: max 10000, safe 8000 07/10/13 12:41:44 DaemonCore: Leaving SendAliveToParent() - pending 07/10/13 12:41:44 Completed DC_CHILDALIVE to daemon at <10.9.15.247:40306> 07/10/13 12:41:51 CCBListener: sent heartbeat to server. 07/10/13 12:42:02 CCBListener: received heartbeat from server. 07/10/13 13:01:25 DaemonCore: in SendAliveToParent() 07/10/13 13:01:25 DaemonCore: Leaving SendAliveToParent() - pending 07/10/13 13:01:25 Completed DC_CHILDALIVE to daemon at <10.9.15.247:40306> 07/10/13 13:02:02 CCBListener: sent heartbeat to server. 07/10/13 13:02:03 CCBListener: received heartbeat from server. 07/10/13 13:21:07 DaemonCore: in SendAliveToParent() 07/10/13 13:21:07 DaemonCore: Leaving SendAliveToParent() - pending 07/10/13 13:21:07 Completed DC_CHILDALIVE to daemon at <10.9.15.247:40306> 07/10/13 13:21:52 Got SIGQUIT. Performing fast shutdown. 07/10/13 13:21:52 ShutdownFast all jobs. 07/10/13 13:21:52 Got ShutdownFast when no jobs running. 07/10/13 13:21:52 HOOK_JOB_EXIT not configured. 07/10/13 13:21:52 Initializing Directory: curr_dir = /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_5586 07/10/13 13:21:52 Entering JICShadow::updateShadow() 07/10/13 13:21:52 condor_write(): Socket closed when trying to write 146 bytes to <152.19.197.180:40797>, fd is 11 07/10/13 13:21:52 Buf::write(): condor_write() failed 07/10/13 13:21:52 Sent job ClassAd update to startd. 07/10/13 13:21:52 JICShadow::updateShadow(): failed to send updateAnd in the submit node's ShadowLog: 07/10/13 13:21:51 Initializing a VANILLA shadow for job 598364.0 07/10/13 13:21:52 (598364.0) (23868): Request to run on slot1@xxxxxxxxxxxxxxxxx <10.9.15.247:40306?CCBID=152.19.197.180:9618#171843&noUDP> was REFUSED 07/10/13 13:21:52 (598364.0) (23868): Job 598364.0 is being evicted from slot1@xxxxxxxxxxxxxxxxx 07/10/13 13:21:52 (598364.0) (23868): logEvictEvent with unknown reason (108), aborting 07/10/13 13:21:52 (598364.0) (23868): **** condor_shadow (condor_SHADOW) pid 23868 EXITING WITH STATUS 108 While watching the execute directory, the file is never fully transfered. Before it finishes, looks like a SIGQUIT is issued. How can I know what issued the SIGQUIT? Thanks, Jason On 07/10/2013 10:54 AM, Dan Bradley wrote: Hi Jason, |