[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Error with the Docker universe



Hi Tim,

 

Thanks for the answer, but that was not it.

The true reason was that condor was trying to write the error files directly in the / directory, where it obviously does not have the rights to do so.

Adding “initial_dir: /tmp” made it write the job log files in /tmp and fixed the issue.

 

Thanks,

Gaëtan


Gaetan Geffroy
Junior Software Engineer
Terma GmbH
T +49 6151 86005 43 (direct)
 


 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Tim Theisen via HTCondor-users
Sent: Friday, March 3, 2023 22:46
To: Greg Thain <gthain@xxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Tim Theisen <tim@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Error with the Docker universe

 

CAUTION: This email originated from outside of Terma. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Hello Gaetan,

 

It looks like condor has not be added to the docker group.

 

    $ sudo usermod -aG docker condor

 

...Tim

--
Tim Theisen (he, him, his)
Release Manager
HTCondor & Open Science Grid
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin - Madison
4261 Computer Sciences and Statistics
1210 W Dayton St
Madison, WI 53706-1685
+1 608 265 5736


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Gaetan Geffroy <gage@xxxxxxxxx>
Sent: Thursday, March 2, 2023 10:54 AM
To: Greg Thain <gthain@xxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Error with the Docker universe

 

Hi Greg,

 

I did not touch that part of the config file. Here is how it looks like by default:

##  Where have you installed the bin, sbin and lib condor directories?

RELEASE_DIR = /usr

 

##  Where is the local condor directory for each host?  This is where the local config file(s), logs and

##  spool/execute directories are located. this is the default for Linux and Unix systems.

LOCAL_DIR = /var

[…]

##  Pathnames

RUN     = $(LOCAL_DIR)/run/condor

LOG     = $(LOCAL_DIR)/log/condor

LOCK    = $(LOCAL_DIR)/lock/condor

SPOOL   = $(LOCAL_DIR)/lib/condor/spool

EXECUTE = $(LOCAL_DIR)/lib/condor/execute

BIN     = $(RELEASE_DIR)/bin

LIB     = $(RELEASE_DIR)/lib64/condor

INCLUDE = $(RELEASE_DIR)/include/condor

SBIN    = $(RELEASE_DIR)/sbin

LIBEXEC = $(RELEASE_DIR)/libexec/condor

SHARE   = $(RELEASE_DIR)/share/condor

 

The StarterLog.slot1 looks like this:

 

03/02/23 16:25:52 (pid:10199) Submitting machine is "localhost.localdomain"

03/02/23 16:25:52 (pid:10199) setting the orig job name in starter

03/02/23 16:25:52 (pid:10199) setting the orig job iwd in starter

03/02/23 16:25:52 (pid:10199) Chirp config summary: IO false, Updates false, Delayed updates true.

03/02/23 16:25:52 (pid:10199) Initialized IO Proxy.

03/02/23 16:25:52 (pid:10199) Done setting resource limits

03/02/23 16:25:52 (pid:10199) Set filetransfer runtime ads to /var/lib/condor/execute/dir_10199/.job.ad and /var/lib/condor/execute/dir_10199/.machine.ad.

03/02/23 16:25:52 (pid:10199) File transfer completed successfully.

03/02/23 16:25:53 (pid:10199) Job 4.0 set to execute immediately

03/02/23 16:25:53 (pid:10199) Starting a VANILLA universe job with ID: 4.0

03/02/23 16:25:53 (pid:10199) Adding /var/lib/condor/execute/dir_10199/tmp/:/tmp as a docker volume to mount under scratch

03/02/23 16:25:53 (pid:10199) Adding /var/lib/condor/execute/dir_10199/var/tmp/:/var/tmp as a docker volume to mount under scratch

03/02/23 16:25:53 (pid:10199) About to exec docker:/bin/sleep

03/02/23 16:25:53 (pid:10199) Found 0 entries in docker image cache.

03/02/23 16:25:53 (pid:10199) Attempting to run: /usr/bin/docker create --cpu-shares=100 --memory=5982m --cap-drop=all --security-opt no-new-privileges --hostname condor-4.0-dedamme-pi03.darmstadt.terma.com --name HTCJob4_0_slot1_PID10199 --label=org.htcondorproject=True -e TF_NUM_THREADS=1 -e TEMP=/var/lib/condor/execute/dir_10199 -e OMP_NUM_THREADS=1 -e BATCH_SYSTEM=HTCondor -e TMPDIR=/var/lib/condor/execute/dir_10199 -e _CONDOR_JOB_PIDS= -e CUBACORES=1 -e OMP_THREAD_LIMIT=1 -e _CHIRP_DELAYED_UPDATE_PREFIX=Chirp* -e _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_10199 -e TMP=/var/lib/condor/execute/dir_10199 -e OPENBLAS_NUM_THREADS=1 -e JULIA_NUM_THREADS=1 -e _CONDOR_BIN=/usr/bin -e _CONDOR_JOB_AD=/var/lib/condor/execute/dir_10199/.job.ad -e GOMAXPROCS=1 -e _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_10199 -e _CONDOR_SLOT=slot1 -e NUMEXPR_NUM_THREADS=1 -e _CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_10199/.chirp.config -e _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_10199/.machine.ad -e MKL_NUM_THREADS=1 -e TF_LOOP_PARALLEL_ITERATIONS=1 --volume /var/lib/condor/execute/dir_10199:/var/lib/condor/execute/dir_10199 --volume /var/lib/condor/execute/dir_10199/tmp/:/tmp --volume /var/lib/condor/execute/dir_10199/var/tmp/:/var/tmp --workdir /var/lib/condor/execute/dir_10199 --user 64:64 --group-add 64 python:3.8.10 /bin/sleep 30

03/02/23 16:25:54 (pid:10199) Create_Process succeeded, pid=10206

03/02/23 16:25:54 (pid:10199) Process exited, pid=10206, status=0

03/02/23 16:25:54 (pid:10199) Output file: /var/lib/condor/execute/dir_10199/_condor_stdout

03/02/23 16:25:54 (pid:10199) Error file: /var/lib/condor/execute/dir_10199/_condor_stderr

03/02/23 16:25:54 (pid:10199) Runnning: /usr/bin/docker start -a HTCJob4_0_slot1_PID10199

03/02/23 16:25:55 (pid:10199) unhandled job exit: pid=10206, status=0

03/02/23 16:26:26 (pid:10199) Process exited, pid=10217, status=0

03/02/23 16:26:27 (pid:10199) DoUpload: (Condor error code 12, subcode 13) STARTER at 127.0.0.1 failed to send file(s) to <127.0.0.1:9618>; SHADOW at 127.0.0.1 failed to write to file //docker_stderror: (errno 13) Permission denied

03/02/23 16:26:27 (pid:10199) JICShadow::notifyJobTermination(): Sending mock terminate event.

03/02/23 16:26:27 (pid:10199) JIC::transferOutput() failed, waiting for job lease to expire or for a reconnect attempt

03/02/23 16:26:27 (pid:10199) Returning from Starter::JobReaper()

03/02/23 16:26:27 (pid:10199) Got SIGQUIT.  Performing fast shutdown.

03/02/23 16:26:27 (pid:10199) ShutdownFast all jobs.

03/02/23 16:26:27 (pid:10199) condor_read(): Socket closed abnormally when trying to read 5 bytes from <127.0.0.1:41778>, errno=104 Connection reset by peer

03/02/23 16:26:27 (pid:10199) i/o error result is 0, errno is 104

03/02/23 16:26:27 (pid:10199) Lost connection to shadow, waiting 2400 secs for reconnect

03/02/23 16:26:27 (pid:10199) DockerProc::JobExit() container 'HTCJob4_0_slot1_PID10199'

03/02/23 16:26:29 (pid:10199) RPC error: disconnected from shadow

03/02/23 16:26:29 (pid:10199) Failed to send job exit status to shadow

03/02/23 16:26:29 (pid:10199) All jobs have exited... starter exiting

03/02/23 16:26:29 (pid:10199) **** condor_starter (condor_STARTER) pid 10199 EXITING WITH STATUS 0

 

Thanks,

Gaëtan

 


Gaetan Geffroy
Junior Software Engineer
Terma GmbH
T +49 6151 86005 43 (direct)
 


 

From: Greg Thain <gthain@xxxxxxxxxxx>
Sent: Thursday, March 2, 2023 04:27
To: Gaetan Geffroy <gage@xxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Error with the Docker universe

 

CAUTION: This email originated from outside of Terma. Do not click links or open attachments unless you recognize the sender and know the content is safe.

 

On 2/24/23 5:28 AM, Gaetan Geffroy wrote:

 

Thanks for the answer. Adding should_transfer_files worked !

But now I am left with an even more obscure error…

Hold reason: Error from slot1@mymachine: STARTER at 127.0.0.1 failed to send file(s) to <127.0.0.1:9618>; SHADOW at 127.0.0.1 failed to write to file //docker_stderror: (errno 13) Permission denied

 

 

The docker_stderror file should be written in the job's scratch directory, which should be under the machine config parameter $(EXECUTE).  Is it possible that this knob is set incorrectly?  Otherwise, we'll need to see the StarterLog.slotXXX file to see what's going on.

 

-greg