Hello Gaetan,
It looks like condor has not be added to the docker group.
$ sudo usermod -aG docker condor
...Tim
--
Tim Theisen (he, him, his) Release Manager HTCondor & Open Science Grid Center for High Throughput Computing Department of Computer Sciences University of Wisconsin - Madison 4261 Computer Sciences and Statistics 1210 W Dayton St Madison, WI 53706-1685 +1 608 265 5736 From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Gaetan Geffroy <gage@xxxxxxxxx>
Sent: Thursday, March 2, 2023 10:54 AM To: Greg Thain <gthain@xxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] Error with the Docker universe Hi Greg,
I did not touch that part of the config file. Here is how it looks like by default: ## Where have you installed the bin, sbin and lib condor directories? RELEASE_DIR = /usr
## Where is the local condor directory for each host? This is where the local config file(s), logs and ## spool/execute directories are located. this is the default for Linux and Unix systems. LOCAL_DIR = /var […] ## Pathnames RUN = $(LOCAL_DIR)/run/condor LOG = $(LOCAL_DIR)/log/condor LOCK = $(LOCAL_DIR)/lock/condor SPOOL = $(LOCAL_DIR)/lib/condor/spool EXECUTE = $(LOCAL_DIR)/lib/condor/execute BIN = $(RELEASE_DIR)/bin LIB = $(RELEASE_DIR)/lib64/condor INCLUDE = $(RELEASE_DIR)/include/condor SBIN = $(RELEASE_DIR)/sbin LIBEXEC = $(RELEASE_DIR)/libexec/condor SHARE = $(RELEASE_DIR)/share/condor
The StarterLog.slot1 looks like this:
03/02/23 16:25:52 (pid:10199) Submitting machine is "localhost.localdomain" 03/02/23 16:25:52 (pid:10199) setting the orig job name in starter 03/02/23 16:25:52 (pid:10199) setting the orig job iwd in starter 03/02/23 16:25:52 (pid:10199) Chirp config summary: IO false, Updates false, Delayed updates true. 03/02/23 16:25:52 (pid:10199) Initialized IO Proxy. 03/02/23 16:25:52 (pid:10199) Done setting resource limits 03/02/23 16:25:52 (pid:10199) Set filetransfer runtime ads to /var/lib/condor/execute/dir_10199/.job.ad and /var/lib/condor/execute/dir_10199/.machine.ad. 03/02/23 16:25:52 (pid:10199) File transfer completed successfully. 03/02/23 16:25:53 (pid:10199) Job 4.0 set to execute immediately 03/02/23 16:25:53 (pid:10199) Starting a VANILLA universe job with ID: 4.0 03/02/23 16:25:53 (pid:10199) Adding /var/lib/condor/execute/dir_10199/tmp/:/tmp as a docker volume to mount under scratch 03/02/23 16:25:53 (pid:10199) Adding /var/lib/condor/execute/dir_10199/var/tmp/:/var/tmp as a docker volume to mount under scratch 03/02/23 16:25:53 (pid:10199) About to exec docker:/bin/sleep 03/02/23 16:25:53 (pid:10199) Found 0 entries in docker image cache. 03/02/23 16:25:53 (pid:10199) Attempting to run: /usr/bin/docker create --cpu-shares=100 --memory=5982m --cap-drop=all --security-opt no-new-privileges --hostname condor-4.0-dedamme-pi03.darmstadt.terma.com --name HTCJob4_0_slot1_PID10199 --label=org.htcondorproject=True -e TF_NUM_THREADS=1 -e TEMP=/var/lib/condor/execute/dir_10199 -e OMP_NUM_THREADS=1 -e BATCH_SYSTEM=HTCondor -e TMPDIR=/var/lib/condor/execute/dir_10199 -e _CONDOR_JOB_PIDS= -e CUBACORES=1 -e OMP_THREAD_LIMIT=1 -e _CHIRP_DELAYED_UPDATE_PREFIX=Chirp* -e _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_10199 -e TMP=/var/lib/condor/execute/dir_10199 -e OPENBLAS_NUM_THREADS=1 -e JULIA_NUM_THREADS=1 -e _CONDOR_BIN=/usr/bin -e _CONDOR_JOB_AD=/var/lib/condor/execute/dir_10199/.job.ad -e GOMAXPROCS=1 -e _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_10199 -e _CONDOR_SLOT=slot1 -e NUMEXPR_NUM_THREADS=1 -e _CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_10199/.chirp.config -e _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_10199/.machine.ad -e MKL_NUM_THREADS=1 -e TF_LOOP_PARALLEL_ITERATIONS=1 --volume /var/lib/condor/execute/dir_10199:/var/lib/condor/execute/dir_10199 --volume /var/lib/condor/execute/dir_10199/tmp/:/tmp --volume /var/lib/condor/execute/dir_10199/var/tmp/:/var/tmp --workdir /var/lib/condor/execute/dir_10199 --user 64:64 --group-add 64 python:3.8.10 /bin/sleep 30 03/02/23 16:25:54 (pid:10199) Create_Process succeeded, pid=10206 03/02/23 16:25:54 (pid:10199) Process exited, pid=10206, status=0 03/02/23 16:25:54 (pid:10199) Output file: /var/lib/condor/execute/dir_10199/_condor_stdout 03/02/23 16:25:54 (pid:10199) Error file: /var/lib/condor/execute/dir_10199/_condor_stderr 03/02/23 16:25:54 (pid:10199) Runnning: /usr/bin/docker start -a HTCJob4_0_slot1_PID10199 03/02/23 16:25:55 (pid:10199) unhandled job exit: pid=10206, status=0 03/02/23 16:26:26 (pid:10199) Process exited, pid=10217, status=0 03/02/23 16:26:27 (pid:10199) DoUpload: (Condor error code 12, subcode 13) STARTER at 127.0.0.1 failed to send file(s) to <127.0.0.1:9618>; SHADOW at 127.0.0.1 failed to write to file //docker_stderror: (errno 13) Permission denied 03/02/23 16:26:27 (pid:10199) JICShadow::notifyJobTermination(): Sending mock terminate event. 03/02/23 16:26:27 (pid:10199) JIC::transferOutput() failed, waiting for job lease to expire or for a reconnect attempt 03/02/23 16:26:27 (pid:10199) Returning from Starter::JobReaper() 03/02/23 16:26:27 (pid:10199) Got SIGQUIT. Performing fast shutdown. 03/02/23 16:26:27 (pid:10199) ShutdownFast all jobs. 03/02/23 16:26:27 (pid:10199) condor_read(): Socket closed abnormally when trying to read 5 bytes from <127.0.0.1:41778>, errno=104 Connection reset by peer 03/02/23 16:26:27 (pid:10199) i/o error result is 0, errno is 104 03/02/23 16:26:27 (pid:10199) Lost connection to shadow, waiting 2400 secs for reconnect 03/02/23 16:26:27 (pid:10199) DockerProc::JobExit() container 'HTCJob4_0_slot1_PID10199' 03/02/23 16:26:29 (pid:10199) RPC error: disconnected from shadow 03/02/23 16:26:29 (pid:10199) Failed to send job exit status to shadow 03/02/23 16:26:29 (pid:10199) All jobs have exited... starter exiting 03/02/23 16:26:29 (pid:10199) **** condor_starter (condor_STARTER) pid 10199 EXITING WITH STATUS 0
Thanks, Gaëtan
From: Greg Thain <gthain@xxxxxxxxxxx>
CAUTION: This email originated from outside of Terma. Do not click links or open attachments unless you recognize the sender and know the content is safe.
On 2/24/23 5:28 AM, Gaetan Geffroy wrote:
The docker_stderror file should be written in the job's scratch directory, which should be under the machine config parameter $(EXECUTE). Is it possible that this knob is set incorrectly? Otherwise, we'll need to see the StarterLog.slotXXX file to see what's going on.
-greg |