Dan,
First off, thanks for your help.
The submit node bandwidth is a bottleneck (1GigE). And it would
seem that adding "STARTER_TIMEOUT_MULTIPLIER=5" helps, but I am
not out of the woods yet. I am now getting the following on the
execute node:
07/10/13 12:22:03 Sending GoAhead for 152.19.197.180 to send /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_5586/130503_UNC11-SN627_0296_BC228JACXX_AGTTCC_L001.fixed-rg.deduped.realign.fix.pr.bam and all further files.
07/10/13 12:22:03 Received GoAhead from peer to receive /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_5586/130503_UNC11-SN627_0296_BC228JACXX_AGTTCC_L001.fixed-rg.deduped.realign.fix.pr.bam.
07/10/13 12:22:03 get_file(): going to write to filename /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_5586/130503_UNC11-SN627_0296_BC228JACXX_AGTTCC_L001.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 12:22:03 get_file: Receiving 2628616372 bytes
07/10/13 12:41:44 DaemonCore: in SendAliveToParent()
07/10/13 12:41:44 File descriptor limits: max 10000, safe 8000
07/10/13 12:41:44 DaemonCore: Leaving SendAliveToParent() - pending
07/10/13 12:41:44 Completed DC_CHILDALIVE to daemon at <10.9.15.247:40306>
07/10/13 12:41:51 CCBListener: sent heartbeat to server.
07/10/13 12:42:02 CCBListener: received heartbeat from server.
07/10/13 13:01:25 DaemonCore: in SendAliveToParent()
07/10/13 13:01:25 DaemonCore: Leaving SendAliveToParent() - pending
07/10/13 13:01:25 Completed DC_CHILDALIVE to daemon at <10.9.15.247:40306>
07/10/13 13:02:02 CCBListener: sent heartbeat to server.
07/10/13 13:02:03 CCBListener: received heartbeat from server.
07/10/13 13:21:07 DaemonCore: in SendAliveToParent()
07/10/13 13:21:07 DaemonCore: Leaving SendAliveToParent() - pending
07/10/13 13:21:07 Completed DC_CHILDALIVE to daemon at <10.9.15.247:40306>
07/10/13 13:21:52 Got SIGQUIT. Performing fast shutdown.
07/10/13 13:21:52 ShutdownFast all jobs.
07/10/13 13:21:52 Got ShutdownFast when no jobs running.
07/10/13 13:21:52 HOOK_JOB_EXIT not configured.
07/10/13 13:21:52 Initializing Directory: curr_dir = /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_5586
07/10/13 13:21:52 Entering JICShadow::updateShadow()
07/10/13 13:21:52 condor_write(): Socket closed when trying to write 146 bytes to <152.19.197.180:40797>, fd is 11
07/10/13 13:21:52 Buf::write(): condor_write() failed
07/10/13 13:21:52 Sent job ClassAd update to startd.
07/10/13 13:21:52 JICShadow::updateShadow(): failed to send update
And in the submit node's ShadowLog:
07/10/13 13:21:51 Initializing a VANILLA shadow for job 598364.0
07/10/13 13:21:52 (598364.0) (23868): Request to run on slot1@xxxxxxxxxxxxxxxxx <10.9.15.247:40306?CCBID=152.19.197.180:9618#171843&noUDP> was REFUSED
07/10/13 13:21:52 (598364.0) (23868): Job 598364.0 is being evicted from slot1@xxxxxxxxxxxxxxxxx
07/10/13 13:21:52 (598364.0) (23868): logEvictEvent with unknown reason (108), aborting
07/10/13 13:21:52 (598364.0) (23868): **** condor_shadow (condor_SHADOW) pid 23868 EXITING WITH STATUS 108
While watching the execute directory, the file is never fully
transfered. Before it finishes, looks like a SIGQUIT is
issued. How can I know what issued the SIGQUIT?
Thanks,
Jason
On 07/10/2013 10:54 AM, Dan Bradley wrote: