Hi
Hi all,
We have two HTCondor pools and flock jobs from one cluster to the other. The submit node runs with 9.1.2, while the worker nodes we flock to run 9.0.13. I'll try condor_ssh_to_job to a running flocked job at the other pool. The jobs run inside a docker container as user nobody.
When I use condor_ssh_to_job as root user on the submit machine, it works fine, and I'm inside the docker container. Independent of whom submitted the job.
When an ordinary user tries to ssh into a flocked job, it gets after a while, "Failed to connect to starter". condor_ssh_to_job works fine within the cluster the job was submitted.
I looked at the StarterLog (see below), and it seems that it gets stuck by ordinary users. After "Created security session for job owner", the starter queries docker regularly but nothing else. After "Created security session for job owner" condor runs a "docker exec -it ..." when the user root runs condor_ssh_to_job.
Could this be a problem with authentication? I did not find any security message in the logs that looks problematic.
Best regards,
Matthias