On 10/25/2022 9:39 AM, Benoit Roland
wrote:
Dear
all,
I am having some issue to use condor_ssh_to_job, resulting in an
"ssh_exchange_identification: Connection closed by remote host".
I am using the condor version 9.0.14 for x86_64_CentOS7.
The situation is the following.
I am running HTCondor in an Apptainer container, and run a test
job in an Apptainer container via this HTCondor setup.
Hi Benoit,
Some questions and initial thoughts --
So you placed the HTCondor Execution Point (EP) daemons
(condor_startd etc) inside of an Apptainer container.... question:
how did you create this container? Did you you setup your
container image by using the get.htcondor.org tool, or by using
the official htcondor image, or by some other means?ÂÂÂÂ Does your
container include sshd ? Could you (easily) try using HTCondor
v9.12.0+ and see if you have better results?
Next, you said you ran a test job in an Apptainer container via
this setup.... does this mean you submitted a vanilla universe
job against the EP running inside a container, or were there two
containers involved: the container running the EP daemons, and
then a second container that was your job which was running nested
inside the EP container? If the latter, how was the job container
launched -- via HTCondor's Singularity support or via a shell
script included within the job itself?Â
Also note in the release notes at
https://htcondor.readthedocs.io/en/latest/version-history/stable-release-series-90.htmlÂ
the following entry for HTCondor v9.0.17 :Â 'If âSingularityâ is
really the âApptainerâ runtime, HTCondor now sets environment
variables to be passed to the job appropriately, which prevents
Apptainer from displaying ugly warnings about how this wonât work
in the future.' Â Â So support for HTCondor launching container
jobs correctly via Apptainer was not added until v9.0.17; setting
up ssh_to_job requires the ability to set environment variables,
which did not work correctly with Apptainer  in v9.0.14 since
this was not backwards compatible with Singularity.
You mentioned ".condor_ssh_to_job_1" dir does not exist --Â if I
recall correctly, it only exists during the time a ssh_to_job is
in progress, so it may be difficult to "catch in the act" if it is
failing quickly. Perhaps in a test container you could wrap
/usr/sbin/sshd with a script that sleeps long enough for you to
inspect that directory and/or copies out the contents of that
directory. In any event, the ssh_config file is generated via the
template located in the file at
/usr/lib64/condor/condor_ssh_to_job_sshd_config_template ....
perhaps looking at that file will help you out.
Hope the above is helpful,
regards,
Todd