Hi, in our condor set-up we start condor with the usual systemd service file as root which subsequently drops privileges to become the local user 'condor'. We also move this user's directory locations to /local/condor and mimic the directories needed usually found under /var/lib/condor. However, it may be we are missing something as on a host we currently restart peacefully, we see error messages like: Failed to open '.update.ad' to read update ad: Permission denied (13). Failed to rename(/local/condor/execute/dir_17864/core,/local/condor/execute/dir_17864/core.402177.0): errno 13\ (Permission denied) and the like (full Starterlog attached for easier reading). Is this something we should worry about or "just noise"? At first glance, permissions look ok, e.g. ls -ld /local/condor/* drwxr-xr-x 2 condor condor 10 Feb 13 15:48 /local/condor/ViewHist drwxr-xr-x 2 condor condor 10 Feb 13 15:48 /local/condor/ckpt drwxr-xr-x 2 condor condor 10 Mar 16 09:28 /local/condor/cred_dir drwxr-xr-x 119 condor condor 4096 May 11 09:07 /local/condor/execute drwxr-xr-x 3 condor condor 40 Mar 16 09:28 /local/condor/spool and ls -ld /local/condor/execute/*|head -n 5 drwx------ 4 condor condor 288 May 11 09:07 /local/condor/execute/dir_1082 drwx------ 4 condor condor 288 May 11 09:07 /local/condor/execute/dir_11017 drwx------ 4 condor condor 288 May 11 09:07 /local/condor/execute/dir_1149 drwx------ 4 condor condor 288 May 11 09:07 /local/condor/execute/dir_11880 drwx------ 4 condor condor 288 May 11 09:07 /local/condor/execute/dir_12025 Obviously, as the job is already gone, /local/condor/execute/dir_17864 does not exist anymore. Anyone with a hint what is wrong here? Cheers carsten -- Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics, CallinstraÃe 38, 30167 Hannover, Germany Phone: +49 511 762 17185
05/11/20 06:09:47 (pid:17864) ****************************************************** 05/11/20 06:09:47 (pid:17864) ** condor_starter (CONDOR_STARTER) STARTING UP 05/11/20 06:09:47 (pid:17864) ** /usr/sbin/condor_starter 05/11/20 06:09:47 (pid:17864) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 05/11/20 06:09:47 (pid:17864) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 05/11/20 06:09:47 (pid:17864) ** $CondorVersion: 8.8.8 Mar 20 2020 BuildID: Debian-8.8.8-1 PackageID: 8.8.8-1 Debian-8.8.8-1 $ 05/11/20 06:09:47 (pid:17864) ** $CondorPlatform: X86_64-Debian_10 $ 05/11/20 06:09:47 (pid:17864) ** PID = 17864 05/11/20 06:09:47 (pid:17864) ** Log last touched 5/11 06:09:18 05/11/20 06:09:47 (pid:17864) ****************************************************** 05/11/20 06:09:47 (pid:17864) Using config source: /etc/condor/condor_config 05/11/20 06:09:47 (pid:17864) Using local config sources: 05/11/20 06:09:47 (pid:17864) /etc/condor/config.d/01_generic 05/11/20 06:09:47 (pid:17864) /etc/condor/config.d/10_EXECUTE 05/11/20 06:09:47 (pid:17864) /etc/condor/condor_config.local 05/11/20 06:09:47 (pid:17864) config Macros = 100, Sorted = 99, StringBytes = 2506, TablesBytes = 3656 05/11/20 06:09:47 (pid:17864) CLASSAD_CACHING is OFF 05/11/20 06:09:47 (pid:17864) Daemon Log is logging: D_ALWAYS D_ERROR 05/11/20 06:09:47 (pid:17864) SharedPortEndpoint: waiting for connections to named socket 50180_1972_1711 05/11/20 06:09:47 (pid:17864) DaemonCore: command socket at <10.10.37.5:9618?addrs=10.10.37.5-9618&noUDP&sock=50180_1972_1711> 05/11/20 06:09:47 (pid:17864) DaemonCore: private command socket at <10.10.37.5:9618?addrs=10.10.37.5-9618&noUDP&sock=50180_1972_1711> 05/11/20 06:09:47 (pid:17864) Communicating with shadow <10.20.30.17:9618?addrs=10.20.30.17-9618&noUDP&sock=3248571_e61f_507703> 05/11/20 06:09:47 (pid:17864) Submitting machine is "condor2.atlas.local" 05/11/20 06:09:47 (pid:17864) setting the orig job name in starter 05/11/20 06:09:47 (pid:17864) setting the orig job iwd in starter 05/11/20 06:09:47 (pid:17864) Job has WantIOProxy=true 05/11/20 06:09:47 (pid:17864) Chirp config summary: IO true, Updates true, Delayed updates true. 05/11/20 06:09:47 (pid:17864) Initialized IO Proxy. 05/11/20 06:09:47 (pid:17864) Done setting resource limits 05/11/20 06:09:47 (pid:17864) File transfer completed successfully. 05/11/20 06:09:47 (pid:17864) Job 402177.0 set to execute immediately 05/11/20 06:09:47 (pid:17864) Starting a VANILLA universe job with ID: 402177.0 05/11/20 06:09:47 (pid:17864) Current mount, /tmp, is shared. 05/11/20 06:09:47 (pid:17864) Current mount, /var, is shared. 05/11/20 06:09:47 (pid:17864) IWD: /work/USER/[......] 05/11/20 06:09:47 (pid:17864) Output file: /local/condor/execute/dir_17864/_condor_stdout 05/11/20 06:09:47 (pid:17864) Error file: /local/condor/execute/dir_17864/_condor_stderr 05/11/20 06:09:47 (pid:17864) Renice expr "0" evaluated to 0 05/11/20 06:09:47 (pid:17864) Running job as user USER 05/11/20 06:09:47 (pid:17864) About to exec /usr/bin/../bin/pegasus-kickstart [long command line deleted] 05/11/20 06:09:47 (pid:17864) Create_Process succeeded, pid=17868 05/11/20 08:50:12 (pid:17864) Failed to open '.update.ad' to read update ad: Permission denied (13). 05/11/20 08:50:12 (pid:17864) Failed to open '.update.ad' to read update ad: Permission denied (13). 05/11/20 08:53:17 (pid:17864) Process exited, pid=17868, status=0 05/11/20 08:53:17 (pid:17864) Failed to rename(/local/condor/execute/dir_17864/core.17868,/local/condor/execute/dir_17864/core.402177.0): errno 13 (Permission denied) 05/11/20 08:53:17 (pid:17864) Failed to rename(/local/condor/execute/dir_17864/core,/local/condor/execute/dir_17864/core.402177.0): errno 13 (Permission denied) 05/11/20 08:53:17 (pid:17864) Failed to open '.update.ad' to read update ad: Permission denied (13). 05/11/20 08:53:17 (pid:17864) ReliSock: put_file: Failed to open file /local/condor/execute/dir_17864/_condor_stdout, errno = 13. 05/11/20 08:53:17 (pid:17864) ReliSock: put_file: Failed to open file /local/condor/execute/dir_17864/_condor_stderr, errno = 13. 05/11/20 08:53:17 (pid:17864) DoUpload: (Condor error code 13, subcode 13) STARTER at 10.10.37.5 failed to send file(s) to <10.20.30.17:9618>: error reading from /local/condor/execute/dir_17864/_condor_stdout: (errno 13) Permission denied; SHADOW failed to receive file(s) from <10.10.37.5:43947> 05/11/20 08:53:17 (pid:17864) JICShadow::notifyJobTermination(): Sending mock terminate event. 05/11/20 08:53:17 (pid:17864) JIC::transferOutput() failed, waiting for job lease to expire or for a reconnect attempt 05/11/20 08:53:17 (pid:17864) Returning from CStarter::JobReaper() 05/11/20 08:53:17 (pid:17864) Connection to shadow may be lost, will test by sending whoami request. 05/11/20 08:53:17 (pid:17864) condor_write(): Socket closed when trying to write 21 bytes to <10.20.30.17:29033>, fd is 9 05/11/20 08:53:17 (pid:17864) Buf::write(): condor_write() failed 05/11/20 08:53:17 (pid:17864) i/o error result is 0, errno is 0 05/11/20 08:53:17 (pid:17864) Lost connection to shadow, waiting 2400 secs for reconnect 05/11/20 08:53:17 (pid:17864) Got SIGQUIT. Performing fast shutdown. 05/11/20 08:53:17 (pid:17864) ShutdownFast all jobs. 05/11/20 08:53:17 (pid:17864) Failed to open '.update.ad' to read update ad: Permission denied (13). 05/11/20 08:53:17 (pid:17864) Failed to open '.update.ad' to read update ad: Permission denied (13). 05/11/20 08:53:17 (pid:17864) Failed to send job exit status to shadow 05/11/20 08:53:17 (pid:17864) All jobs have exited... starter exiting 05/11/20 08:53:17 (pid:17864) **** condor_starter (condor_STARTER) pid 17864 EXITING WITH STATUS 0
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature