Yep, Found the one I saw earlier, a job from our ARC CE. I’ll have another hunt to see if the same thing is occurring on our htcondorces. Similarities I notice straight off is that we’re both using IO Proxies between the starters and shadows. Cheers, Iain 02/16/16 17:15:19 (pid:1609794) ****************************************************** 02/16/16 17:15:19 (pid:1609794) ** condor_starter (CONDOR_STARTER) STARTING UP 02/16/16 17:15:19 (pid:1609794) ** /usr/sbin/condor_starter 02/16/16 17:15:19 (pid:1609794) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 02/16/16 17:15:19 (pid:1609794) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 02/16/16 17:15:19 (pid:1609794) ** $CondorVersion: 8.3.8 Aug 27 2015 BuildID: 338845 $ 02/16/16 17:15:19 (pid:1609794) ** $CondorPlatform: X86_64-RedHat_6.6 $ 02/16/16 17:15:19 (pid:1609794) ** PID = 1609794 02/16/16 17:15:19 (pid:1609794) ** Log last touched 2/16 17:13:59 02/16/16 17:15:19 (pid:1609794) ****************************************************** 02/16/16 17:15:19 (pid:1609794) Using config source: /etc/condor/condor_config 02/16/16 17:15:19 (pid:1609794) Using local config sources: 02/16/16 17:15:19 (pid:1609794) /etc/condor/config.d/10_security.config 02/16/16 17:15:19 (pid:1609794) /etc/condor/config.d/14_network.config 02/16/16 17:15:19 (pid:1609794) /etc/condor/config.d/20_workernode.config 02/16/16 17:15:19 (pid:1609794) /etc/condor/config.d/24_workernode_crons.config 02/16/16 17:15:19 (pid:1609794) /etc/condor/condor_config.local 02/16/16 17:15:19 (pid:1609794) config Macros = 170, Sorted = 169, StringBytes = 6318, TablesBytes = 6200 02/16/16 17:15:19 (pid:1609794) CLASSAD_CACHING is OFF 02/16/16 17:15:19 (pid:1609794) Daemon Log is logging: D_ALWAYS D_ERROR 02/16/16 17:15:19 (pid:1609794) SharedPortEndpoint: waiting for connections to named socket 15788_d2fb_365 02/16/16 17:15:19 (pid:1609794) DaemonCore: command socket at <:9618?addrs=-9618&noUDP&sock=15788_d2fb_365> 02/16/16 17:15:19 (pid:1609794) DaemonCore: private command socket at <:9618?addrs=-9618&noUDP&sock=15788_d2fb_365> 02/16/16 17:15:19 (pid:1609794) Communicating with shadow <:9618?addrs=-9618&noUDP&sock=8114_cc03_875902> 02/16/16 17:15:19 (pid:1609794) Submitting machine is “ce501.cern.ch" 02/16/16 17:15:19 (pid:1609794) setting the orig job name in starter 02/16/16 17:15:19 (pid:1609794) setting the orig job iwd in starter 02/16/16 17:15:19 (pid:1609794) Chirp config summary: IO false, Updates false, Delayed updates true. 02/16/16 17:15:19 (pid:1609794) Initialized IO Proxy. 02/16/16 17:15:19 (pid:1609794) Done setting resource limits 02/16/16 17:15:19 (pid:1609794) File transfer completed successfully. 02/16/16 17:15:20 (pid:1609794) Job 1281546.0 set to execute immediately 02/16/16 17:15:20 (pid:1609794) Starting a VANILLA universe job with ID: 1281546.0 02/16/16 17:15:20 (pid:1609794) IWD: /pool/condor/dir_1609794 02/16/16 17:15:21 (pid:1609794) Output file: /pool/condor/dir_1609794/_condor_stdout 02/16/16 17:15:21 (pid:1609794) Error file: /pool/condor/dir_1609794/_condor_stdout 02/16/16 17:15:21 (pid:1609794) Renice expr "10" evaluated to 10 02/16/16 17:15:21 (pid:1609794) Using wrapper /usr/local/condor/job_wrapper to exec /usr/libexec/condor/condor_pid_ns_init condor_exec.exe 02/16/16 17:15:21 (pid:1609794) Running job as user <blanked> 02/16/16 17:15:21 (pid:1609794) Create_Process succeeded, pid=1609798 02/16/16 17:15:21 (pid:1609794) Limitting memory usage to 2097152000 bytes 02/16/16 17:15:21 (pid:1609794) Limitting memsw usage to 35696619520 bytes 02/16/16 17:15:40 (pid:1609794) Process exited, pid=1609798, status=0 02/16/16 17:15:41 (pid:1609794) Connection to shadow may be lost, will test by sending whoami request. 02/16/16 17:15:41 (pid:1609794) condor_write(): Socket closed when trying to write 37 bytes to <:37550>, fd is 19 02/16/16 17:15:41 (pid:1609794) Buf::write(): condor_write() failed 02/16/16 17:15:41 (pid:1609794) i/o error result is 0, errno is 0 02/16/16 17:15:41 (pid:1609794) Lost connection to shadow, waiting 2400 secs for reconnect 02/16/16 17:15:41 (pid:1609794) Got SIGQUIT. Performing fast shutdown. 02/16/16 17:15:41 (pid:1609794) ShutdownFast all jobs. 02/16/16 17:15:41 (pid:1609794) **** condor_starter (condor_STARTER) pid 1609794 EXITING WITH STATUS 0 > On Feb 16, 2016, at 23:29, andrew.lahiff@xxxxxxxxxx wrote: > > Hi, > > We seem to be getting messages like this in the StarterLogs very frequently: > > 02/16/16 20:51:37 (pid:2619) Connection to shadow may be lost, will test by sending whoami request. > 02/16/16 20:51:37 (pid:2619) condor_write(): Socket closed when trying to write 37 bytes to <x.y.z.t:8847>, fd is 7 > 02/16/16 20:51:37 (pid:2619) Buf::write(): condor_write() failed > 02/16/16 20:51:37 (pid:2619) i/o error result is 0, errno is 0 > 02/16/16 20:51:37 (pid:2619) Lost connection to shadow, waiting 2400 secs for reconnect > > The complete StarterLog for this job is shown below [1], as well as the associated ShadowLog [2]. > > Has anyone else encountered this problem or know what could be causing it? I think it only happens with 8.4.x schedds (with 8.2.7 at least I can't reproduce it). > > Many Thanks, > Andrew. > > [1] > 02/16/16 20:51:34 (pid:2619) ****************************************************** > 02/16/16 20:51:34 (pid:2619) ** condor_starter (CONDOR_STARTER) STARTING UP > 02/16/16 20:51:34 (pid:2619) ** /usr/sbin/condor_starter > 02/16/16 20:51:34 (pid:2619) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) > 02/16/16 20:51:34 (pid:2619) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON > 02/16/16 20:51:34 (pid:2619) ** $CondorVersion: 8.3.7 Jul 23 2015 BuildID: 331383 $ > 02/16/16 20:51:34 (pid:2619) ** $CondorPlatform: X86_64-RedHat_6.6 $ > 02/16/16 20:51:34 (pid:2619) ** PID = 2619 > 02/16/16 20:51:34 (pid:2619) ** Log last touched 2/16 20:49:32 > 02/16/16 20:51:34 (pid:2619) ****************************************************** > 02/16/16 20:51:34 (pid:2619) Using config source: /etc/condor/condor_config > 02/16/16 20:51:34 (pid:2619) Using local config sources: > 02/16/16 20:51:34 (pid:2619) /etc/condor/config.d/00-general > 02/16/16 20:51:34 (pid:2619) /etc/condor/config.d/01-security > 02/16/16 20:51:34 (pid:2619) /etc/condor/config.d/10-worker-node > 02/16/16 20:51:34 (pid:2619) /etc/condor/config.d/11-cgroups > 02/16/16 20:51:34 (pid:2619) /etc/condor/config.d/96-hibernation > 02/16/16 20:51:34 (pid:2619) /etc/condor/config.d/97-preemptable > 02/16/16 20:51:34 (pid:2619) /etc/condor/config.d/98-reduce-power-now > 02/16/16 20:51:34 (pid:2619) /etc/condor/condor_config.local > 02/16/16 20:51:34 (pid:2619) config Macros = 148, Sorted = 147, StringBytes = 6526, TablesBytes = 5440 > 02/16/16 20:51:34 (pid:2619) CLASSAD_CACHING is OFF > 02/16/16 20:51:34 (pid:2619) Daemon Log is logging: D_ALWAYS D_ERROR > 02/16/16 20:51:34 (pid:2619) Daemoncore: Listening at <0.0.0.0:42560> on TCP (ReliSock) and UDP (SafeSock). > 02/16/16 20:51:34 (pid:2619) DaemonCore: command socket at <a.b.c.d:42560?addrs=a.b.c.d-42560> > 02/16/16 20:51:34 (pid:2619) DaemonCore: private command socket at <a.b.c.d:42560?addrs=a.b.c.d-42560> > 02/16/16 20:51:34 (pid:2619) Communicating with shadow <x.y.z.t:29414?addrs=x.y.z.t-29414&noUDP> > 02/16/16 20:51:34 (pid:2619) Submitting machine is "arc-ce05.domain” > 02/16/16 20:51:34 (pid:2619) setting the orig job name in starter > 02/16/16 20:51:34 (pid:2619) setting the orig job iwd in starter > 02/16/16 20:51:34 (pid:2619) Chirp config summary: IO false, Updates false, Delayed updates true. > 02/16/16 20:51:34 (pid:2619) Initialized IO Proxy. > 02/16/16 20:51:34 (pid:2619) Done setting resource limits > 02/16/16 20:51:35 (pid:2619) File transfer completed successfully. > 02/16/16 20:51:36 (pid:2619) Job 3714.0 set to execute immediately > 02/16/16 20:51:36 (pid:2619) Starting a VANILLA universe job with ID: 3714.0 > 02/16/16 20:51:36 (pid:2619) IWD: /pool/condor/dir_2619 > 02/16/16 20:51:36 (pid:2619) Renice expr "10" evaluated to 10 > 02/16/16 20:51:36 (pid:2619) About to exec /usr/libexec/condor/condor_pid_ns_init condor_exec.exe 1 > 02/16/16 20:51:36 (pid:2619) Running job as user nagios > 02/16/16 20:51:36 (pid:2619) Create_Process succeeded, pid=4471 > 02/16/16 20:51:37 (pid:2619) Process exited, pid=4471, status=0 > 02/16/16 20:51:37 (pid:2619) Connection to shadow may be lost, will test by sending whoami request. > 02/16/16 20:51:37 (pid:2619) condor_write(): Socket closed when trying to write 37 bytes to <x.y.z.t:8847>, fd is 7 > 02/16/16 20:51:37 (pid:2619) Buf::write(): condor_write() failed > 02/16/16 20:51:37 (pid:2619) i/o error result is 0, errno is 0 > 02/16/16 20:51:37 (pid:2619) Lost connection to shadow, waiting 2400 secs for reconnect > 02/16/16 20:51:37 (pid:2619) Got SIGQUIT. Performing fast shutdown. > 02/16/16 20:51:37 (pid:2619) ShutdownFast all jobs. > 02/16/16 20:51:37 (pid:2619) **** condor_starter (condor_STARTER) pid 2619 EXITING WITH STATUS 0 > > > [2] > 02/16/16 20:51:33 ****************************************************** > 02/16/16 20:51:33 ** condor_shadow (CONDOR_SHADOW) STARTING UP > 02/16/16 20:51:33 ** /usr/sbin/condor_shadow > 02/16/16 20:51:33 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1) > 02/16/16 20:51:33 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON > 02/16/16 20:51:33 ** $CondorVersion: 8.4.4 Feb 03 2016 BuildID: 355883 $ > 02/16/16 20:51:33 ** $CondorPlatform: x86_RedHat6 $ > 02/16/16 20:51:33 ** PID = 115754 > 02/16/16 20:51:33 ** Log last touched 2/16 20:48:10 > 02/16/16 20:51:33 ****************************************************** > 02/16/16 20:51:33 Using config source: /etc/condor/condor_config > 02/16/16 20:51:33 Using local config sources: > 02/16/16 20:51:33 /etc/condor/config.d/10security.config > 02/16/16 20:51:33 /etc/condor/config.d/12resourcelimits.config > 02/16/16 20:51:33 /etc/condor/config.d/13users-blacklists.config > 02/16/16 20:51:33 /etc/condor/config.d/14accounting-groups.config > 02/16/16 20:51:33 /etc/condor/config.d/21schedd.config > 02/16/16 20:51:33 /etc/condor/config.d/24per-job-history-files.config > 02/16/16 20:51:33 /etc/condor/condor_config.local > 02/16/16 20:51:33 config Macros = 234, Sorted = 234, StringBytes = 13208, TablesBytes = 1928 > 02/16/16 20:51:33 CLASSAD_CACHING is OFF > 02/16/16 20:51:33 Daemon Log is logging: D_ALWAYS D_ERROR > 02/16/16 20:51:33 Daemoncore: Listening at <0.0.0.0:29414> on TCP (ReliSock). > 02/16/16 20:51:33 DaemonCore: command socket at <x.y.z.t:29414?addrs=x.y.z.t-29414&noUDP> > 02/16/16 20:51:33 DaemonCore: private command socket at <x.y.z.t:29414?addrs=x.y.z.t-29414> > 02/16/16 20:51:33 Initializing a VANILLA shadow for job 3714.0 > 02/16/16 20:51:33 (3714.0) (115754): Request to run on slot1@xxxxxxxxxxxxxx <a.b.c.d:44704?addrs=a.b.c.d-44704> was ACCEPTED > 02/16/16 20:51:35 (3714.0) (115754): File transfer completed successfully. > 02/16/16 20:51:37 (3714.0) (115754): Job 3714.0 terminated: exited with status 0 > 02/16/16 20:51:37 (3714.0) (115754): Reporting job exit reason 100 and attempting to fetch new job. > 02/16/16 20:51:37 (3714.0) (115754): **** condor_shadow (condor_SHADOW) pid 115754 EXITING WITH STATUS 100 > > > > > _______________________________________________ > HTCondor-users mailing list > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/htcondor-users/
Attachment:
smime.p7s
Description: S/MIME cryptographic signature