Hi,
Over the past month or so we’ve been getting large numbers of held
jobs, whereas previously we normally had close to zero. What seems to
be happening is:
(1) a job is running successfully
(2) the user’s process finishes
(3) in the ShadowLog we see
10/08/14 14:26:26 (860002.0) (30272): ERROR "Error from
slot1@xxxxxxxxxxxxxx: Close_Pipe error" at line 558 in file
/slots/03/dir_2866/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp
and at the same time in the StarterLog we see:
10/08/14 14:26:26 (pid:18932) Close_Pipe on invalid pipe end: 0
10/08/14 14:26:26 (pid:18932) ERROR "Close_Pipe error" at line 2089
in file
/slots/04/dir_38687/userdir/src/condor_daemon_core.V6/daemon_core.cpp
(4) the job gets restarted on a different worker node, then for some
reason the Shadow has permission problems trying to write to output
files, then the job goes into the held state.
Below I show an extract from the ShadowLog [1] and StarterLog [2].
We are currently using 8.2.2. I think the problems started occurring
at around the time we upgraded the worker nodes from 8.0.6 to 8.2.2
(although I'm not 100% sure).
Has anyone seen this before or know what could be causing these
Close_Pipe errors?
Many Thanks,
Andrew.
[1]
10/07/14 12:43:39 Initializing a VANILLA shadow for job 860002.0
10/07/14 12:43:39 (860002.0) (30272): Request to run on
slot1@xxxxxxxxxxxxxx <130.a.b.c:35760> was ACCEPTED
10/07/14 12:43:40 (860002.0) (30272): File transfer completed
successfully.
10/08/14 14:26:26 (860002.0) (30272): File transfer completed
successfully.
10/08/14 14:26:26 (860002.0) (30272): ERROR "Error from
slot1@xxxxxxxxxxxxxx: Close_Pipe error" at line 558 in file
/slots/03/dir_2866/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp
10/08/14 14:27:33 Initializing a VANILLA shadow for job 860002.0
10/08/14 14:27:33 (860002.0) (49167): Request to run on
slot1@xxxxxxxxxxxxxx <130.a.b.c:45410> was ACCEPTED
10/08/14 14:27:33 (860002.0) (49167): File transfer completed
successfully.
10/08/14 14:37:13 (860002.0) (49167): get_file(): Failed to open file
/var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.out,
errno = 13: Permission denied.
10/08/14 14:37:13 (860002.0) (49167): get_file(): consumed 57681
bytes of file transmission
10/08/14 14:37:13 (860002.0) (49167): DoDownload: consuming rest of
transfer and failing after encountering the following error: SHADOW
at 130.a.b.c failed to write to file
/var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.out:
(errno 13) Permission denied
10/08/14 14:37:13 (860002.0) (49167): get_file(): Failed to open file
/var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err,
errno = 13: Permission denied.
10/08/14 14:37:13 (860002.0) (49167): get_file(): consumed 0 bytes of
file transmission
10/08/14 14:37:13 (860002.0) (49167): DoDownload: consuming rest of
transfer and failing after encountering the following error: SHADOW
at 130.a.b.c failed to write to file
/var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err:
(errno 13) Permission denied
10/08/14 14:37:13 (860002.0) (49167): Mock terminating job 860002.0:
exited_by_signal=FALSE, exit_code=0 OR exit_signal=0,
core_dumped=FALSE, exit_reason="Exited normally"
10/08/14 14:37:13 (860002.0) (49167): File transfer failed (status=0).
10/08/14 14:37:13 (860002.0) (49167): Job 860002.0 going into Hold
state (code 12,13): Error from slot1@xxxxxxxxxxxxxx: STARTER at
130.a.b.c failed to send file(s) to <130.a.b.c:54052>; SHADOW at
130.a.b.c failed to write to file
/var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err:
(errno 13) Permission denied
10/08/14 14:37:13 (860002.0) (49167): **** condor_shadow
(condor_SHADOW) pid 49167 EXITING WITH STATUS 112
[2]
10/07/14 12:43:40 (pid:18932)
******************************************************
10/07/14 12:43:40 (pid:18932) ** condor_starter (CONDOR_STARTER)
STARTING UP
10/07/14 12:43:40 (pid:18932) ** /usr/sbin/condor_starter
10/07/14 12:43:40 (pid:18932) ** SubsystemInfo: name=STARTER
type=STARTER(8) class=DAEMON(1)
10/07/14 12:43:40 (pid:18932) ** Configuration: subsystem:STARTER
local:<NONE> class:DAEMON
10/07/14 12:43:40 (pid:18932) ** $CondorVersion: 8.2.2 Aug 07 2014
BuildID: 265643 $
10/07/14 12:43:40 (pid:18932) ** $CondorPlatform: x86_64_RedHat6 $
10/07/14 12:43:40 (pid:18932) ** PID = 18932
10/07/14 12:43:40 (pid:18932) ** Log last touched 10/7 12:39:50
10/07/14 12:43:40 (pid:18932)
******************************************************
10/07/14 12:43:40 (pid:18932) Using config source:
/etc/condor/condor_config
10/07/14 12:43:40 (pid:18932) Using local config sources:
10/07/14 12:43:40 (pid:18932) /etc/condor/config.d/10security.config
10/07/14 12:43:40 (pid:18932) /etc/condor/config.d/20wn.config
10/07/14 12:43:40 (pid:18932)
/etc/condor/config.d/20wn_startd_history.config
10/07/14 12:43:40 (pid:18932) /etc/condor/config.d/21slots.config
10/07/14 12:43:40 (pid:18932) /etc/condor/config.d/31cgroups.config
10/07/14 12:43:40 (pid:18932)
/etc/condor/config.d/32mountunderscratch.config
10/07/14 12:43:40 (pid:18932) /etc/condor/condor_config.local
10/07/14 12:43:40 (pid:18932) config Macros = 128, Sorted = 127,
StringBytes = 4666, TablesBytes = 4704
10/07/14 12:43:40 (pid:18932) CLASSAD_CACHING is OFF
10/07/14 12:43:40 (pid:18932) Daemon Log is logging: D_ALWAYS D_ERROR
10/07/14 12:43:40 (pid:18932) DaemonCore: command socket at
<130.a.b.c:38881>
10/07/14 12:43:40 (pid:18932) DaemonCore: private command socket at
<130.a.b.c:38881>
10/07/14 12:43:40 (pid:18932) Communicating with shadow
<130.a.b.c:34061?noUDP>
10/07/14 12:43:40 (pid:18932) Submitting machine is "arc-ce04.domain"
10/07/14 12:43:40 (pid:18932) setting the orig job name in starter
10/07/14 12:43:40 (pid:18932) setting the orig job iwd in starter
10/07/14 12:43:40 (pid:18932) Chirp config summary: IO false, Updates
false, Delayed updates true.
10/07/14 12:43:40 (pid:18932) Initialized IO Proxy.
10/07/14 12:43:40 (pid:18932) Done setting resource limits
10/07/14 12:43:40 (pid:18932) File transfer completed successfully.
10/07/14 12:43:41 (pid:18932) Job 860002.0 set to execute immediately
10/07/14 12:43:41 (pid:18932) Starting a VANILLA universe job with
ID: 860002.0
10/07/14 12:43:41 (pid:18932) IWD: /pool/condor/dir_18932
10/07/14 12:43:41 (pid:18932) Output file:
/pool/condor/dir_18932/_condor_stdout
10/07/14 12:43:41 (pid:18932) Error file:
/pool/condor/dir_18932/_condor_stdout
10/07/14 12:43:41 (pid:18932) Renice expr "10" evaluated to 10
10/07/14 12:43:41 (pid:18932) About to exec
/usr/libexec/condor/condor_pid_ns_init condor_exec.exe
10/07/14 12:43:41 (pid:18932) Running job as user tlhcb005
10/07/14 12:43:41 (pid:18932) Create_Process succeeded, pid=18953
10/08/14 14:26:26 (pid:18932) Process exited, pid=18953, status=0
10/08/14 14:26:26 (pid:18932) Close_Pipe on invalid pipe end: 0
10/08/14 14:26:26 (pid:18932) ERROR "Close_Pipe error" at line 2089
in file
/slots/04/dir_38687/userdir/src/condor_daemon_core.V6/daemon_core.cpp
10/08/14 14:26:26 (pid:18932) ShutdownFast all jobs.