[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Error with the Docker universe



Hi Greg,

 

I did not touch that part of the config file. Here is how it looks like by default:

##  Where have you installed the bin, sbin and lib condor directories?

RELEASE_DIR = /usr

 

##  Where is the local condor directory for each host?  This is where the local config file(s), logs and

##  spool/execute directories are located. this is the default for Linux and Unix systems.

LOCAL_DIR = /var

[â]

##  Pathnames

RUN     = $(LOCAL_DIR)/run/condor

LOG     = $(LOCAL_DIR)/log/condor

LOCK    = $(LOCAL_DIR)/lock/condor

SPOOL   = $(LOCAL_DIR)/lib/condor/spool

EXECUTE = $(LOCAL_DIR)/lib/condor/execute

BIN     = $(RELEASE_DIR)/bin

LIB     = $(RELEASE_DIR)/lib64/condor

INCLUDE = $(RELEASE_DIR)/include/condor

SBIN    = $(RELEASE_DIR)/sbin

LIBEXEC = $(RELEASE_DIR)/libexec/condor

SHARE   = $(RELEASE_DIR)/share/condor

 

The StarterLog.slot1 looks like this:

 

03/02/23 16:25:52 (pid:10199) Submitting machine is "localhost.localdomain"

03/02/23 16:25:52 (pid:10199) setting the orig job name in starter

03/02/23 16:25:52 (pid:10199) setting the orig job iwd in starter

03/02/23 16:25:52 (pid:10199) Chirp config summary: IO false, Updates false, Delayed updates true.

03/02/23 16:25:52 (pid:10199) Initialized IO Proxy.

03/02/23 16:25:52 (pid:10199) Done setting resource limits

03/02/23 16:25:52 (pid:10199) Set filetransfer runtime ads to /var/lib/condor/execute/dir_10199/.job.ad and /var/lib/condor/execute/dir_10199/.machine.ad.

03/02/23 16:25:52 (pid:10199) File transfer completed successfully.

03/02/23 16:25:53 (pid:10199) Job 4.0 set to execute immediately

03/02/23 16:25:53 (pid:10199) Starting a VANILLA universe job with ID: 4.0

03/02/23 16:25:53 (pid:10199) Adding /var/lib/condor/execute/dir_10199/tmp/:/tmp as a docker volume to mount under scratch

03/02/23 16:25:53 (pid:10199) Adding /var/lib/condor/execute/dir_10199/var/tmp/:/var/tmp as a docker volume to mount under scratch

03/02/23 16:25:53 (pid:10199) About to exec docker:/bin/sleep

03/02/23 16:25:53 (pid:10199) Found 0 entries in docker image cache.

03/02/23 16:25:53 (pid:10199) Attempting to run: /usr/bin/docker create --cpu-shares=100 --memory=5982m --cap-drop=all --security-opt no-new-privileges --hostname condor-4.0-dedamme-pi03.darmstadt.terma.com --name HTCJob4_0_slot1_PID10199 --label=org.htcondorproject=True -e TF_NUM_THREADS=1 -e TEMP=/var/lib/condor/execute/dir_10199 -e OMP_NUM_THREADS=1 -e BATCH_SYSTEM=HTCondor -e TMPDIR=/var/lib/condor/execute/dir_10199 -e _CONDOR_JOB_PIDS= -e CUBACORES=1 -e OMP_THREAD_LIMIT=1 -e _CHIRP_DELAYED_UPDATE_PREFIX=Chirp* -e _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_10199 -e TMP=/var/lib/condor/execute/dir_10199 -e OPENBLAS_NUM_THREADS=1 -e JULIA_NUM_THREADS=1 -e _CONDOR_BIN=/usr/bin -e _CONDOR_JOB_AD=/var/lib/condor/execute/dir_10199/.job.ad -e GOMAXPROCS=1 -e _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_10199 -e _CONDOR_SLOT=slot1 -e NUMEXPR_NUM_THREADS=1 -e _CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_10199/.chirp.config -e _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_10199/.machine.ad -e MKL_NUM_THREADS=1 -e TF_LOOP_PARALLEL_ITERATIONS=1 --volume /var/lib/condor/execute/dir_10199:/var/lib/condor/execute/dir_10199 --volume /var/lib/condor/execute/dir_10199/tmp/:/tmp --volume /var/lib/condor/execute/dir_10199/var/tmp/:/var/tmp --workdir /var/lib/condor/execute/dir_10199 --user 64:64 --group-add 64 python:3.8.10 /bin/sleep 30

03/02/23 16:25:54 (pid:10199) Create_Process succeeded, pid=10206

03/02/23 16:25:54 (pid:10199) Process exited, pid=10206, status=0

03/02/23 16:25:54 (pid:10199) Output file: /var/lib/condor/execute/dir_10199/_condor_stdout

03/02/23 16:25:54 (pid:10199) Error file: /var/lib/condor/execute/dir_10199/_condor_stderr

03/02/23 16:25:54 (pid:10199) Runnning: /usr/bin/docker start -a HTCJob4_0_slot1_PID10199

03/02/23 16:25:55 (pid:10199) unhandled job exit: pid=10206, status=0

03/02/23 16:26:26 (pid:10199) Process exited, pid=10217, status=0

03/02/23 16:26:27 (pid:10199) DoUpload: (Condor error code 12, subcode 13) STARTER at 127.0.0.1 failed to send file(s) to <127.0.0.1:9618>; SHADOW at 127.0.0.1 failed to write to file //docker_stderror: (errno 13) Permission denied

03/02/23 16:26:27 (pid:10199) JICShadow::notifyJobTermination(): Sending mock terminate event.

03/02/23 16:26:27 (pid:10199) JIC::transferOutput() failed, waiting for job lease to expire or for a reconnect attempt

03/02/23 16:26:27 (pid:10199) Returning from Starter::JobReaper()

03/02/23 16:26:27 (pid:10199) Got SIGQUIT.  Performing fast shutdown.

03/02/23 16:26:27 (pid:10199) ShutdownFast all jobs.

03/02/23 16:26:27 (pid:10199) condor_read(): Socket closed abnormally when trying to read 5 bytes from <127.0.0.1:41778>, errno=104 Connection reset by peer

03/02/23 16:26:27 (pid:10199) i/o error result is 0, errno is 104

03/02/23 16:26:27 (pid:10199) Lost connection to shadow, waiting 2400 secs for reconnect

03/02/23 16:26:27 (pid:10199) DockerProc::JobExit() container 'HTCJob4_0_slot1_PID10199'

03/02/23 16:26:29 (pid:10199) RPC error: disconnected from shadow

03/02/23 16:26:29 (pid:10199) Failed to send job exit status to shadow

03/02/23 16:26:29 (pid:10199) All jobs have exited... starter exiting

03/02/23 16:26:29 (pid:10199) **** condor_starter (condor_STARTER) pid 10199 EXITING WITH STATUS 0

 

Thanks,

GaÃtan

 


Gaetan Geffroy
Junior Software Engineer
Terma GmbH
T +49 6151 86005 43 (direct)
 


 

From: Greg Thain <gthain@xxxxxxxxxxx>
Sent: Thursday, March 2, 2023 04:27
To: Gaetan Geffroy <gage@xxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Error with the Docker universe

 

CAUTION: This email originated from outside of Terma. Do not click links or open attachments unless you recognize the sender and know the content is safe.

 

On 2/24/23 5:28 AM, Gaetan Geffroy wrote:

 

Thanks for the answer. Adding should_transfer_files worked !

But now I am left with an even more obscure errorâ

Hold reason: Error from slot1@mymachine: STARTER at 127.0.0.1 failed to send file(s) to <127.0.0.1:9618>; SHADOW at 127.0.0.1 failed to write to file //docker_stderror: (errno 13) Permission denied

 

 

The docker_stderror file should be written in the job's scratch directory, which should be under the machine config parameter $(EXECUTE).  Is it possible that this knob is set incorrectly?  Otherwise, we'll need to see the StarterLog.slotXXX file to see what's going on.

 

-greg