[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] question related to condor_ssh_to_job and container



Dear Todd,

first thanks a lot for your advice and your answer.

o) The container running the HTCondor EP daemons is created in the following way:

- CentOS 7 is used as parent image
- The following is used to import the signing key and retrieve the
version 9.0.14 of the HTCondor repository:
ÂÂ "RUN
rpm --import http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor"

ÂÂ "RUN yum install -y https://research.cs.wisc.edu/htcondor/repo/9.0/htcondor-release-current.el7.noarch.rpm"
- The configuration is made using ansible and retrieved by the container entrypoint which is running the condor_master

- The container is including sshd (openssh-server-7.4p1-22.el7_9.x86_64).

- Concerning an update to the version v9.12.0 you mentioned below, I was only able to find the following:
ÂÂ "https://research.cs.wisc.edu/htcondor/repo/9.x/el7/release/condor-9.12.0-1.el7.x86_64.rpm"

- The container is run in the following way to create a worker node running the HTCondor EP daemons:

Â"singularity run --env TardisDroneCores=1 --env TardisDroneMemory=128 --env TardisDroneUuid=test-condor --userns -B $PWD/config:/srv/config -B /tmp:/scratch -B /cvmfs /cvmfs/unpacked.cern.ch/gitlab-p4n.aip.de\:5005/compute4punch/container-stacks/htcondor-wn\:latest/"

[singularity is actually apptainer:Â ls -la /bin/singularity ----> /bin/singularity -> apptainer]

The configuration in $PWD/config contains the specific configuration for our HTCondor worker node, accessed via a gitlab repository.

o) Our machine (c4p-loging-dev) is now an HTCondor worker node (a condor_status gives slot1@test-condor@c4p-login-dev).

A test job is submitted on the same machine (c4p-loging-dev) specifying in the jdl file:
"requirements = Target.Machine =?= "c4p-login-dev".
ÂÂ
The executable is a simple sleep, running in a singularity container defined in the jdl file by:
+SINGULARITY_JOB_CONTAINER = "wlcg-wn:latest"
Â
The container wlcg-wn is using CentOS 7 as parent image and contains the environment to run jobs on the LHC grid.
It includes sshd as well (openssh-server-7.4p1-22.el7_9.x86_64).

The job submission is done on the command line via:
"condor_submit submit-sleep.jdl"

o) Concerning the sshd configuration ".condor_ssh_to_job_1/sshd_config", I wanted to access it to check that the configuration defined
in "condor_ssh_to_job_sshd_config_template" is correctly propagated there.

Thanks a lot again for your answer!

Best wishes,
Benoit


On 25/10/2022 18:59, Todd Tannenbaum wrote:
On 10/25/2022 9:39 AM, Benoit Roland wrote:
Dear all,

I am having some issue to use condor_ssh_to_job, resulting in an "ssh_exchange_identification: Connection closed by remote host".

I am using the condor version 9.0.14 for x86_64_CentOS7.

The situation is the following.

I am running HTCondor in an Apptainer container, and run a test job in an Apptainer container via this HTCondor setup.


Hi Benoit,

Some questions and initial thoughts --

So you placed the HTCondor Execution Point (EP) daemons (condor_startd etc) inside of an Apptainer container.... question: how did you create this container? Did you you setup your container image by using the get.htcondor.org tool, or by using the official htcondor image, or by some other means? Does your container include sshd ? Could you (easily) try using HTCondor v9.12.0+ and see if you have better results?

Next, you said you ran a test job in an Apptainer container via this setup.... does this mean you submitted a vanilla universe job against the EP running inside a container, or were there two containers involved: the container running the EP daemons, and then a second container that was your job which was running nested inside the EP container? If the latter, how was the job container launched -- via HTCondor's Singularity support or via a shell script included within the job itself?Â

Also note in the release notes at https://htcondor.readthedocs.io/en/latest/version-history/stable-release-series-90.html the following entry for HTCondor v9.0.17 : 'If âSingularityâ is really the âApptainerâ runtime, HTCondor now sets environment variables to be passed to the job appropriately, which prevents Apptainer from displaying ugly warnings about how this wonât work in the future.'   So support for HTCondor launching container jobs correctly via Apptainer was not added until v9.0.17; setting up ssh_to_job requires the ability to set environment variables, which did not work correctly with Apptainer  in v9.0.14 since this was not backwards compatible with Singularity.

You mentioned ".condor_ssh_to_job_1" dir does not exist -- if I recall correctly, it only exists during the time a ssh_to_job is in progress, so it may be difficult to "catch in the act" if it is failing quickly. Perhaps in a test container you could wrap /usr/sbin/sshd with a script that sleeps long enough for you to inspect that directory and/or copies out the contents of that directory. In any event, the ssh_config file is generated via the template located in the file at /usr/lib64/condor/condor_ssh_to_job_sshd_config_template .... perhaps looking at that file will help you out.

Hope the above is helpful,
regards,
Todd