[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] question related to condor_ssh_to_job and container



On 10/25/2022 9:39 AM, Benoit Roland wrote:
Dear all,

I am having some issue to use condor_ssh_to_job, resulting in an "ssh_exchange_identification: Connection closed by remote host".

I am using the condor version 9.0.14 for x86_64_CentOS7.

The situation is the following.

I am running HTCondor in an Apptainer container, and run a test job in an Apptainer container via this HTCondor setup.


Hi Benoit,

Some questions and initial thoughts --

So you placed the HTCondor Execution Point (EP) daemons (condor_startd etc) inside of an Apptainer container.... question: how did you create this container?  Did you you setup your container image by using the get.htcondor.org tool, or by using the official htcondor image, or by some other means?     Does your container include sshd ?  Could you (easily) try using HTCondor v9.12.0+ and see if you have better results?

Next, you said you ran a test job in an Apptainer container via this setup....  does this mean you submitted a vanilla universe job against the EP running inside a container, or were there two containers involved: the container running the EP daemons, and then a second container that was your job which was running nested inside the EP container?  If the latter, how was the job container launched -- via HTCondor's Singularity support or via a shell script included within the job itself? 

Also note in the release notes at https://htcondor.readthedocs.io/en/latest/version-history/stable-release-series-90.html  the following entry for HTCondor v9.0.17 :  'If âSingularityâ is really the âApptainerâ runtime, HTCondor now sets environment variables to be passed to the job appropriately, which prevents Apptainer from displaying ugly warnings about how this wonât work in the future.'     So support for HTCondor launching container jobs correctly via Apptainer was not added until v9.0.17; setting up ssh_to_job requires the ability to set environment variables, which did not work correctly with Apptainer   in v9.0.14 since this was not backwards compatible with Singularity.

You mentioned ".condor_ssh_to_job_1" dir does not exist --  if I recall correctly, it only exists during the time a ssh_to_job is in progress, so it may be difficult to "catch in the act" if it is failing quickly.  Perhaps in a test container you could wrap /usr/sbin/sshd with a script that sleeps long enough for you to inspect that directory and/or copies out the contents of that directory.  In any event, the ssh_config file is generated via the template located in the file at /usr/lib64/condor/condor_ssh_to_job_sshd_config_template .... perhaps looking at that file will help you out.

Hope the above is helpful,
regards,
Todd