Hi Tim, Thanks for the answer, but that was not it. The true reason was that condor was trying to write the error files directly in the / directory, where it obviously does not have the rights to do so. Adding “initial_dir: /tmp” made it write the job log files in /tmp and fixed the issue. Thanks, Gaëtan
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Tim Theisen via HTCondor-users
CAUTION:
This email originated from outside of Terma. Do not click links or open attachments unless you recognize the sender and know the content is safe. Hello Gaetan, It looks like condor has not be added to the docker group. $ sudo usermod -aG docker condor ...Tim --
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Gaetan Geffroy <gage@xxxxxxxxx> Hi Greg, I did not touch that part of the config file. Here is how it looks like by default: ## Where have you installed the bin, sbin and lib condor directories? RELEASE_DIR = /usr ## Where is the local condor directory for each host? This is where the local config file(s), logs and ## spool/execute directories are located. this is the default for Linux and Unix systems. LOCAL_DIR = /var […] ## Pathnames RUN = $(LOCAL_DIR)/run/condor LOG = $(LOCAL_DIR)/log/condor LOCK = $(LOCAL_DIR)/lock/condor SPOOL = $(LOCAL_DIR)/lib/condor/spool EXECUTE = $(LOCAL_DIR)/lib/condor/execute BIN = $(RELEASE_DIR)/bin LIB = $(RELEASE_DIR)/lib64/condor INCLUDE = $(RELEASE_DIR)/include/condor SBIN = $(RELEASE_DIR)/sbin LIBEXEC = $(RELEASE_DIR)/libexec/condor SHARE = $(RELEASE_DIR)/share/condor The StarterLog.slot1 looks like this: 03/02/23 16:25:52 (pid:10199) Submitting machine is "localhost.localdomain" 03/02/23 16:25:52 (pid:10199) setting the orig job name in starter 03/02/23 16:25:52 (pid:10199) setting the orig job iwd in starter 03/02/23 16:25:52 (pid:10199) Chirp config summary: IO false, Updates false, Delayed updates true. 03/02/23 16:25:52 (pid:10199) Initialized IO Proxy. 03/02/23 16:25:52 (pid:10199) Done setting resource limits 03/02/23 16:25:52 (pid:10199) Set filetransfer runtime ads to /var/lib/condor/execute/dir_10199/.job.ad and /var/lib/condor/execute/dir_10199/.machine.ad. 03/02/23 16:25:52 (pid:10199) File transfer completed successfully. 03/02/23 16:25:53 (pid:10199) Job 4.0 set to execute immediately 03/02/23 16:25:53 (pid:10199) Starting a VANILLA universe job with ID: 4.0 03/02/23 16:25:53 (pid:10199) Adding /var/lib/condor/execute/dir_10199/tmp/:/tmp as a docker volume to mount under scratch 03/02/23 16:25:53 (pid:10199) Adding /var/lib/condor/execute/dir_10199/var/tmp/:/var/tmp as a docker volume to mount under scratch 03/02/23 16:25:53 (pid:10199) About to exec docker:/bin/sleep 03/02/23 16:25:53 (pid:10199) Found 0 entries in docker image cache. 03/02/23 16:25:53 (pid:10199) Attempting to run: /usr/bin/docker create --cpu-shares=100 --memory=5982m --cap-drop=all --security-opt no-new-privileges --hostname condor-4.0-dedamme-pi03.darmstadt.terma.com
--name HTCJob4_0_slot1_PID10199 --label=org.htcondorproject=True -e TF_NUM_THREADS=1 -e TEMP=/var/lib/condor/execute/dir_10199 -e OMP_NUM_THREADS=1 -e BATCH_SYSTEM=HTCondor -e TMPDIR=/var/lib/condor/execute/dir_10199 -e _CONDOR_JOB_PIDS= -e CUBACORES=1 -e
OMP_THREAD_LIMIT=1 -e _CHIRP_DELAYED_UPDATE_PREFIX=Chirp* -e _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_10199 -e TMP=/var/lib/condor/execute/dir_10199 -e OPENBLAS_NUM_THREADS=1 -e JULIA_NUM_THREADS=1 -e _CONDOR_BIN=/usr/bin -e _CONDOR_JOB_AD=/var/lib/condor/execute/dir_10199/.job.ad
-e GOMAXPROCS=1 -e _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_10199 -e _CONDOR_SLOT=slot1 -e NUMEXPR_NUM_THREADS=1 -e _CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_10199/.chirp.config -e _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_10199/.machine.ad
-e MKL_NUM_THREADS=1 -e TF_LOOP_PARALLEL_ITERATIONS=1 --volume /var/lib/condor/execute/dir_10199:/var/lib/condor/execute/dir_10199 --volume /var/lib/condor/execute/dir_10199/tmp/:/tmp --volume /var/lib/condor/execute/dir_10199/var/tmp/:/var/tmp --workdir /var/lib/condor/execute/dir_10199
--user 64:64 --group-add 64 python:3.8.10 /bin/sleep 30 03/02/23 16:25:54 (pid:10199) Create_Process succeeded, pid=10206 03/02/23 16:25:54 (pid:10199) Process exited, pid=10206, status=0 03/02/23 16:25:54 (pid:10199) Output file: /var/lib/condor/execute/dir_10199/_condor_stdout 03/02/23 16:25:54 (pid:10199) Error file: /var/lib/condor/execute/dir_10199/_condor_stderr 03/02/23 16:25:54 (pid:10199) Runnning: /usr/bin/docker start -a HTCJob4_0_slot1_PID10199 03/02/23 16:25:55 (pid:10199) unhandled job exit: pid=10206, status=0 03/02/23 16:26:26 (pid:10199) Process exited, pid=10217, status=0 03/02/23 16:26:27 (pid:10199) DoUpload: (Condor error code 12, subcode 13) STARTER at 127.0.0.1 failed to send file(s) to <127.0.0.1:9618>; SHADOW at 127.0.0.1 failed to write to file //docker_stderror:
(errno 13) Permission denied 03/02/23 16:26:27 (pid:10199) JICShadow::notifyJobTermination(): Sending mock terminate event. 03/02/23 16:26:27 (pid:10199) JIC::transferOutput() failed, waiting for job lease to expire or for a reconnect attempt 03/02/23 16:26:27 (pid:10199) Returning from Starter::JobReaper() 03/02/23 16:26:27 (pid:10199) Got SIGQUIT. Performing fast shutdown. 03/02/23 16:26:27 (pid:10199) ShutdownFast all jobs. 03/02/23 16:26:27 (pid:10199) condor_read(): Socket closed abnormally when trying to read 5 bytes from <127.0.0.1:41778>, errno=104 Connection reset by peer 03/02/23 16:26:27 (pid:10199) i/o error result is 0, errno is 104 03/02/23 16:26:27 (pid:10199) Lost connection to shadow, waiting 2400 secs for reconnect 03/02/23 16:26:27 (pid:10199) DockerProc::JobExit() container 'HTCJob4_0_slot1_PID10199' 03/02/23 16:26:29 (pid:10199) RPC error: disconnected from shadow 03/02/23 16:26:29 (pid:10199) Failed to send job exit status to shadow 03/02/23 16:26:29 (pid:10199) All jobs have exited... starter exiting 03/02/23 16:26:29 (pid:10199) **** condor_starter (condor_STARTER) pid 10199 EXITING WITH STATUS 0 Thanks, Gaëtan
From: Greg Thain <gthain@xxxxxxxxxxx>
CAUTION:
This email originated from outside of Terma. Do not click links or open attachments unless you recognize the sender and know the content is safe. On 2/24/23 5:28 AM, Gaetan Geffroy wrote:
The docker_stderror file should be written in the job's scratch directory, which should be under the machine config parameter $(EXECUTE). Is it possible that this knob is set incorrectly? Otherwise, we'll need to see the StarterLog.slotXXX file to see
what's going on. -greg |