Hi
Hi all,
We have two HTCondor pools and flock jobs from one cluster to the
other. The submit node runs with 9.1.2, while the worker nodes we
flock to run 9.0.13. I'll try condor_ssh_to_job to a running
flocked job at the other pool. The jobs run inside a docker
container as user nobody.
When I use condor_ssh_to_job as root user on the submit machine,
it works fine, and I'm inside the docker container. Independent of
whom submitted the job.
When an ordinary user tries to ssh into a flocked job, it gets
after a while, "Failed to connect to starter". condor_ssh_to_job
works fine within the cluster the job was submitted.
I looked at the StarterLog (see below), and it seems that it gets
stuck by ordinary users. After "Created security session for job
owner", the starter queries docker regularly but nothing else.
After "Created security session for job owner" condor runs a
"docker exec -it ..." when the user root runs condor_ssh_to_job.
Could this be a problem with authentication? I did not find any
security message in the logs that looks problematic.
Best regards,
Matthias
ordinary user (mschnepf)
08/09/22 15:53:03 (pid:3549512) Created security
session for job owner (mschnepf@xxxxxxxxxxx).
08/09/22 15:53:06 (pid:3549512) condor_read(): Socket closed
when trying to read 1 bytes from Docker Socket
08/09/22 15:53:06 (pid:3549512) sendDockerAPIRequest(GET
/containers/HTCJob1540489_0_slot1_2_PID3549512/stats?stream=0
HTTP/1.0
) = HTTP/1.0 200 OK
Api-Version: 1.41
Content-Type: application/json
Docker-Experimental: false
Ostype: linux
Server: Docker/20.10.12 (linux)
Date: Tue, 09 Aug 2022 13:53:06 GMT
{"read":"2022-08-09T13:53:06.189708447Z","preread":"2022-08-09T13:53:05.184780604Z","pids_stats":{"current":2},"blkio_stats":{"io_service_bytes_recursive":[{"major":259,"minor":2,"op":"Read","value":3321856},{"major":259,"minor":2,"op":"Write","value":0},{"major":259,"minor":2,"op":"Sync","value":0},{"major":259,"m
inor":2,"op":"Async","value":3321856},{"major":259,"minor":2,"op":"Total","value":3321856},{"major":253,"minor":2,"op":"Read","value":3321856},{"major":253,"minor":2,"op":"Write","value":0},{"major":253,"minor":2,"op":"Sync","value":0},{"major":253,"minor":2,"op":"Async","value":3321856},{"major":253,"minor":2,"op"
:"Total","value":3321856}],"io_serviced_recursive":[{"major":259,"minor":2,"op":"Read","value":70},{"major":259,"minor":2,"op":"Write","value":0},{"major":259,"minor":2,"op":"Sync","value":0},{"major":259,"minor":2,"op":"Async","value":70},{"major":259,"minor":2,"op":"Total","value":70},{"major":253,"minor":2,"op":
"Read","value":70},{"major":253,"minor":2,"op":"Write","value":0},{"major":253,"minor":2,"op":"Sync","value":0},{"major":253,"minor":2,"op":"Async","value":70},{"major":253,"minor":2,"op":"Total","value":70}],"io_queue_recursive":[],"io_service_time_recursive":[],"io_wait_time_recursive":[],"io_merged_recursive":[]
,"io_time_recursive":[],"sectors_recursive":[]},"num_procs":0,"storage_stats":{},"cpu_stats":{"cpu_usage":{"total_usage":95595265,"percpu_usage":[0,1167765,636512,2089669,0,603600,0,0,0,0,0,0,120723,3280308,2509666,885261,114234,2095969,223605,108435,153023,0,145928,3664141,0,68348505,1848270,0,0,0,0,0,0,0,0,0,0,58
24522,475131,1218407,81591,0,0,0,0,0,0,0],"usage_in_kernelmode":40000000,"usage_in_usermode":50000000},"system_cpu_usage":54830782350000000,"online_cpus":48,"throttling_data":{"periods":0,"throttled_periods":0,"throttled_time":0}},"precpu_stats":{"cpu_usage":{"total_usage":95595265,"percpu_usage":[0,1167765,636512,
2089669,0,603600,0,0,0,0,0,0,120723,3280308,2509666,885261,114234,2095969,223605,108435,153023,0,145928,3664141,0,68348505,1848270,0,0,0,0,0,0,0,0,0,0,5824522,475131,1218407,81591,0,0,0,0,0,0,0],"usage_in_kernelmode":40000000,"usage_in_usermode":50000000},"system_cpu_usage":54830734180000000,"online_cpus":48,"throt
tling_data":{"periods":0,"throttled_periods":0,"throttled_time":0}},"memory_stats":{"usage":3321856,"max_usage":10293248,"stats":{"active_anon":286720,"active_file":823296,"cache":3035136,"dirty":0,"hierarchical_memory_limit":3145728000,"hierarchical_memsw_limit":6291456000,"inactive_anon":0,"inactive_file":2211840
,"mapped_file":1232896,"pgfault":3968,"pgmajfault":28,"pgpgin":1868,"pgpgout":1057,"rss":286720,"rss_huge":0,"total_active_anon":286720,"total_active_file":823296,"total_cache":3035136,"total_dirty":0,"total_inactive_anon":0,"total_inactive_file":2211840,"total_mapped_file":1232896,"total_pgfault":0,"total_pgmajfau
lt":0,"total_pgpgin":0,"total_pgpgout":0,"total_rss":286720,"total_rss_huge":0,"total_unevictable":0,"total_writeback":0,"unevictable":0,"writeback":0},"limit":3145728000},"name":"/HTCJob1540489_0_slot1_2_PID3549512","id":"386bfb25118e13fe30ae1e629705cb64903e866138a8fc6e756b063e388cf183","networks":{"eth0":{"rx_byt
es":746,"rx_packets":7,"rx_errors":0,"rx_dropped":0,"tx_bytes":656,"tx_packets":8,"tx_errors":0,"tx_dropped":0}}}
08/09/22 15:53:06 (pid:3549512) docker stats reports max_usage
is 286720 rx_bytes is 746 tx_bytes is 656 usage_in_usermode is
50000000 usage_in-sysmode is 40000000
08/09/22 15:53:12 (pid:3549512) condor_read(): Socket closed
when trying to read 1 bytes from Docker Socket
08/09/22 15:53:12 (pid:3549512) sendDockerAPIRequest(GET
/containers/HTCJob1540489_0_slot1_2_PID3549512/stats?stream=0
HTTP/1.0
User condor
08/09/22 15:42:16 (pid:3545932) Created security
session for job owner (condor@xxxxxxxxxxx).
08/09/22 15:42:16 (pid:3545932) DockerProc::PublishToEnv()
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment proto
'GPU_DEVICE_ORDINAL=/(CUDA|OCL)//Â CUDA_VISIBLE_DEVICES=/CUDA//'
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment
'GPU_DEVICE_ORDINAL' pattern: /(CUDA|OCL)//Â
CUDA_VISIBLE_DEVICES=/CUDA//
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment
'GPU_DEVICE_ORDINAL' no-match of pattern: (CUDA|OCL)
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment proto
'CUDA_VISIBLE_DEVICES=/CUDA//'
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment
'CUDA_VISIBLE_DEVICES' pattern: /CUDA//
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment
'CUDA_VISIBLE_DEVICES' no-match of pattern: CUDA
08/09/22 15:42:16 (pid:3545932) Checking preferred shells:
/bin/bash
08/09/22 15:42:16 (pid:3545932) Will use shell /bin/bash
08/09/22 15:42:16 (pid:3545932) StartSSHD:
session_dir='/var/lib/condor/execute/dir_3545932/.condor_ssh_to_job_1'
08/09/22 15:42:16 (pid:3545932) Setting
LD_PRELOAD=/usr/lib64/condor/libgetpwnam.so for sshd
08/09/22 15:42:16 (pid:3545932) In OsProc::OsProc()
08/09/22 15:42:16 (pid:3545932) Main job KillSignal: 15
(SIGTERM)
08/09/22 15:42:16 (pid:3545932) Main job RmKillSignal: 15
(SIGTERM)
08/09/22 15:42:16 (pid:3545932) Main job HoldKillSignal: 15
(SIGTERM)
08/09/22 15:42:16 (pid:3545932) in SSHDProc::StartJob()
08/09/22 15:42:16 (pid:3545932) in VanillaProc::StartJob()
08/09/22 15:42:16 (pid:3545932) Requesting cgroup
htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxx/sshd
for job.
08/09/22 15:42:16 (pid:3545932) Value of RequestedChroot is
unset.
08/09/22 15:42:16 (pid:3545932) Adding mapping:
/var/lib/condor/execute/dir_3545932/tmp/ -> /tmp.
08/09/22 15:42:16 (pid:3545932) Checking the mapping of mount
point /tmp.
08/09/22 15:42:16 (pid:3545932) Current mount, /, is shared.
08/09/22 15:42:16 (pid:3545932) Adding mapping:
/var/lib/condor/execute/dir_3545932/var/tmp/ -> /var/tmp.
08/09/22 15:42:16 (pid:3545932) Checking the mapping of mount
point /var/tmp.
08/09/22 15:42:16 (pid:3545932) Current mount, /var, is shared.
08/09/22 15:42:16 (pid:3545932) PID namespace option: false
08/09/22 15:42:16 (pid:3545932) in OsProc::StartJob()
08/09/22 15:42:16 (pid:3545932) IWD:
/var/lib/condor/execute/dir_3545932
08/09/22 15:42:16 (pid:3545932) DockerProc::PublishToEnv()
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment proto
'GPU_DEVICE_ORDINAL=/(CUDA|OCL)//Â CUDA_VISIBLE_DEVICES=/CUDA//'
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment
'GPU_DEVICE_ORDINAL' pattern: /(CUDA|OCL)//Â
CUDA_VISIBLE_DEVICES=/CUDA//
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment
'GPU_DEVICE_ORDINAL' no-match of pattern: (CUDA|OCL)
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment proto
'CUDA_VISIBLE_DEVICES=/CUDA//'
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment
'CUDA_VISIBLE_DEVICES' pattern: /CUDA//
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment
'CUDA_VISIBLE_DEVICES' no-match of pattern: CUDA
08/09/22 15:42:16 (pid:3545932) Error file:
/var/lib/condor/execute/dir_3545932/.condor_ssh_to_job_1/sshd.log
08/09/22 15:42:16 (pid:3545932) Renice expr "10" evaluated to 10
08/09/22 15:42:16 (pid:3545932) Env =
_CONDOR_JOB_IWD=/var/lib/condor/execute/dir_3545932
CUDA_VISIBLE_DEVICES=10000 _CONDOR_SHELL=/bin/bash
_CONDOR_SLOT=slot1_1 OPENBLAS_NUM_THREADS=1
TF_LOOP_PARALLEL_ITERATIONS=1 NUMEXPR_NUM_THREADS=1 TMPDIR=/tmp
TEMP=/tmp GPU_DEVICE_ORDINAL=10000
_CHIRP_DELAYED_UPDATE_PREFIX=Chi
rp* _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_3545932
CUBACORES=1 BATCH_SYSTEM=HTCondor _CONDOR_AssignedGPUs=10000
GOMAXPROCS=1 OMP_THREAD_LIMIT=1 TMP=/tmp
_CONDOR_WRAPPER_ERROR_FILE=/var/lib/condor/execute/dir_3545932/.job_wrapper_failure
_CONDOR_SLOT_NAME=slot1@xxxxxxxxxxxxxxxxxxxxxxx
JULIA_NUM_THREADS=1 _C
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1
TF_NUM_THREADS=1 _CONDOR_JOB_PIDS=3545960
_CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_3545932/.chirp.config
LD_PRELOAD=/usr/lib64/condor/libgetpwnam.so
_CONDOR_JOB_AD=/var/lib/condor/execute/dir_3545932/.job.ad
_CONDOR_MACHINE_AD=/var/lib/condor/execute/di
r_3545932/.machine.ad
08/09/22 15:42:16 (pid:3545932) ENFORCE_CPU_AFFINITY not true,
not setting affinity
08/09/22 15:42:16 (pid:3545932) Running job as user nobody
08/09/22 15:42:16 (pid:3545932) Using wrapper
/usr/libexec/condor/jobwrapper.sh to exec /usr/sbin/sshd -i -e
-f
/var/lib/condor/execute/dir_3545932/.condor_ssh_to_job_1/sshd_config
08/09/22 15:42:16 (pid:3546902) track_family_via_cgroup:
Tracking PID 3546902 via cgroup
htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxx/sshd.
08/09/22 15:42:16 (pid:3546902) About to tell ProcD to track
family with root 3546902 via cgroup
htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxx/sshd
08/09/22 15:42:16 (pid:3546902) Mounting /dev/shm as a private
mount successful.
08/09/22 15:42:17 (pid:3545932) Create_Process succeeded,
pid=3546902
08/09/22 15:42:17 (pid:3545932) Initializing cgroup library.
08/09/22 15:42:17 (pid:3545932) Limiting (soft) memory usage to
0 bytes
08/09/22 15:42:17 (pid:3545932) Limiting memsw usage to
9223372036854775807 bytes
08/09/22 15:42:17 (pid:3545932) Limiting (hard) memory usage to
404154744832 bytes
08/09/22 15:42:17 (pid:3545932) Limiting (soft) memory usage to
3145728000 bytes
08/09/22 15:42:17 (pid:3545932) Subscribed the starter to OOM
notification for this cgroup; jobs triggering an OOM will be put
on hold.
08/09/22 15:42:17 (pid:3545932) Process exited, pid=3546890,
status=0
08/09/22 15:42:17 (pid:3545932) Reaper: all=2 handled=0
ShuttingDown=0
08/09/22 15:42:17 (pid:3545932) unhandled job exit: pid=3546890,
status=0
08/09/22 15:42:17 (pid:3545932) Accepted new connection from ssh
client for docker job
08/09/22 15:42:17 (pid:3545932) DockerProc::PublishToEnv()
08/09/22 15:42:17 (pid:3545932) AssignedGPUs environment proto
'GPU_DEVICE_ORDINAL=/(CUDA|OCL)//Â CUDA_VISIBLE_DEVICES=/CUDA//'
08/09/22 15:42:17 (pid:3545932) AssignedGPUs environment
'GPU_DEVICE_ORDINAL' pattern: /(CUDA|OCL)//Â
CUDA_VISIBLE_DEVICES=/CUDA//
08/09/22 15:42:17 (pid:3545932) AssignedGPUs environment
'GPU_DEVICE_ORDINAL' no-match of pattern: (CUDA|OCL)
08/09/22 15:42:17 (pid:3545932) AssignedGPUs environment proto
'CUDA_VISIBLE_DEVICES=/CUDA//'
08/09/22 15:42:17 (pid:3545932) AssignedGPUs environment
'CUDA_VISIBLE_DEVICES' pattern: /CUDA//
08/09/22 15:42:17 (pid:3545932) AssignedGPUs environment
'CUDA_VISIBLE_DEVICES' no-match of pattern: CUDA
08/09/22 15:42:17 (pid:3545932) adding 27 environment vars to
docker args
08/09/22 15:42:17 (pid:3545932) execing:
/etc/condor/scripts/docker_wrapper.py exec -ti -e
_CONDOR_JOB_IWD=/var/lib/condor/execute/dir_3545932 -e
CUDA_VISIBLE_DEVICES=10000 -e _CONDOR_SLOT=slot1_1 -e
OPENBLAS_NUM_THREADS=1 -e TF_LOOP_PARALLEL_ITERATIONS=1 -e
NUMEXPR_NUM_THREADS=1 -e TMPDIR=/tmp -e TEMP=/tmp -e GPU_
DEVICE_ORDINAL=10000 -e _CHIRP_DELAYED_UPDATE_PREFIX=Chirp* -e
_CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_3545932 -e
CUBACORES=1 -e BATCH_SYSTEM=HTCondor -e
_CONDOR_AssignedGPUs=10000 -e GOMAXPROCS=1 -e OMP_THREAD_LIMIT=1
-e TMP=/tmp -e
_CONDOR_WRAPPER_ERROR_FILE=/var/lib/condor/execute/dir_3545932/.job_wrappe
r_failure -e JULIA_NUM_THREADS=1 -e _CONDOR_BIN=/usr/bin -e
MKL_NUM_THREADS=1 -e OMP_NUM_THREADS=1 -e TF_NUM_THREADS=1 -e
_CONDOR_JOB_PIDS=3545960\ 3546902 -e
_CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_3545932/.chirp.config
-e _CONDOR_JOB_AD=/var/lib/condor/execute/dir_3545932/.job.ad -e
_CONDOR_MACHINE_AD=/v
ar/lib/condor/execute/dir_3545932/.machine.ad
HTCJob1540488_0_slot1_1_PID3545932 /bin/bash -i
08/09/22 15:42:17 (pid:3545932) docker exec returned 0 for pid
3546935