Hi all,
on some of our worker nodes, docker jobs didn't start. The
starter log showed in fulldebug mode:
...
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Completed
DC_CHILDALIVE to daemon at <129.13.101.177:9618>
04/04/18 03:49:11 (pid:14500) (D_ALWAYS:2) Sending GoAhead for
129.13.101.141 to send
/var/lib/condor/execute/dir_14497/condor_exec.exe and all
further files.
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) DaemonCore: Leaving
SendAliveToParent() - success
04/04/18 03:49:11 (pid:14500) (D_ALWAYS:2) Received GoAhead from
peer to receive
/var/lib/condor/execute/dir_14497/condor_exec.exe.
04/04/18 03:49:11 (pid:14500) (D_ALWAYS:2) get_file(): going to
write to filename
/var/lib/condor/execute/dir_14497/condor_exec.exe
04/04/18 03:49:11 (pid:14500) (D_ALWAYS:2) get_file: Receiving
2127 bytes
04/04/18 03:49:11 (pid:14500) (D_ALWAYS:2) get_file: wrote 2127
bytes to file
04/04/18 03:49:11 (pid:14500) (D_ALWAYS:2)
ReliSock::get_file_with_permissions(): going to set permissions
777
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) DaemonCore: No more
children processes to reap.
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) File transfer completed
successfully.
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Calling client
FileTransfer handler function.
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) HOOK_PREPARE_JOB not
configured.
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) Job 568739.0 set to
execute immediately
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) Starting a VANILLA
universe job with ID: 568739.0
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) In OsProc::OsProc()
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Main job KillSignal:
15 (SIGTERM)
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Main job
RmKillSignal: 15 (SIGTERM)
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Main job
HoldKillSignal: 15 (SIGTERM)
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Cmd:
'condor_exec.exe'
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Input file: /dev/null
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) Output file:
/var/lib/condor/execute/dir_14497/_condor_stdout
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) Error file:
/var/lib/condor/execute/dir_14497/_condor_stderr
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) Adding /cvmfs:/cvmfs as
a docker volume to mount
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) About to exec
docker:./condor_exec.exe
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) FileLock object is
updating timestamp on: /var/log/condor/.startd_docker_images
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) FileLock::obtain(1) -
@1522828151.058259 lock on /var/log/condor/.startd_docker_images
now WRITE
04/04/18 03:49:11 (pid:14497) (D_ALWAYS) Found 32 entries in
docker image cache.
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Attempting to run:
/usr/bin/docker rmi \n
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Attempting to run:
'/usr/bin/docker images -q \n'.
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Attempting to run:
/usr/bin/docker rmi 0
04/04/18 03:49:11 (pid:14497) (D_ALWAYS:2) Attempting to run:
'/usr/bin/docker images -q <80>'.
Stack dump for process 14497 at timestamp 1522828152 (16 frames)
/lib64/libcondor_utils_8_6_5.so(dprintf_dump_stack+0x72)[0x7f1c7ce6ee32]
/lib64/libcondor_utils_8_6_5.so(_Z18linux_sig_coredumpi+0x24)[0x7f1c7cff9434]
/lib64/libpthread.so.0(+0xf5e0)[0x7f1c7b5485e0]
/lib64/libc.so.6(gsignal+0x37)[0x7f1c7b1ab1f7]
/lib64/libc.so.6(abort+0x148)[0x7f1c7b1ac8e8]
/lib64/libc.so.6(+0x74f47)[0x7f1c7b1eaf47]
/lib64/libc.so.6(+0x7c619)[0x7f1c7b1f2619]
/lib64/libcondor_utils_8_6_5.so(_ZN9DockerAPI3runERN14compat_classad7ClassAdES2_RKSsS4_S4_RK7ArgListRK3EnvS4_St4listISsSaISsEERiPiR11CondorError+0xeac)[0x7f1c7ce3334c]
condor_starter(_ZN10DockerProc8StartJobEv+0xb66)[0x4547b6]
condor_starter(_ZN8CStarter8SpawnJobEv+0xc3)[0x45b8c3]
condor_starter(_ZN8CStarter14SpawnPreScriptEv+0x197)[0x4598c7]
/lib64/libcondor_utils_8_6_5.so(_ZN12TimerManager7TimeoutEPiPd+0x182)[0x7f1c7cff8712]
/lib64/libcondor_utils_8_6_5.so(_ZN10DaemonCore6DriverEv+0x9cb)[0x7f1c7cfdc7fb]
/lib64/libcondor_utils_8_6_5.so(_Z7dc_mainiPPc+0x13a4)[0x7f1c7cffcaa4]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f1c7b197c05]
condor_starter[0x422840]
I removed the file /var/log/condor/.startd_docker_images on the
corresponding worker nodes and now docker jobs run correctly
again. I put one of these files in the attachment. The file on the
corrupted nodes has some line breaks at the beginning of the file.
I removed some old docker images on that machines. Could this
cause the corrupted startd_docker_image file?
Cheers,
Matthias