Dear Greg, Am 26.02.19 um 18:18 schrieb Greg Thain:
On 2/26/19 11:09 AM, Oliver Freyermuth wrote:Dear HTCondor experts, dear Greg, trying a dirty hack to replace "-a" with "-m -u -i -n -p -U" still makes things fail miserably, since Singularity has somehow already exited when nsenter is called:How has Singularity exited? It should still be running the job at that time?
I'm also rather stupefied by this. Here's what I see with 10 millisecond process tree snapshots. First, all is well: freyermu 18402 2.0 0.0 20000 832 ? SNs 18:22 0:00 \_ /usr/libexec/singularity/bin/action-suid /bin/sleep 180 freyermu 18414 0.0 0.0 27288 856 ? SN 18:22 0:00 \_ shim-init /bin/sleep 180 freyermu 18415 0.0 0.0 4116 312 ? SN 18:22 0:00 \_ /bin/sleep 180 Then, condor_ssh_to_job is prepared: freyermu 18402 2.0 0.0 20000 832 ? SNs 18:22 0:00 \_ /usr/libexec/singularity/bin/action-suid /bin/sleep 180 freyermu 18414 0.0 0.0 27288 856 ? SN 18:22 0:00 | \_ shim-init /bin/sleep 180 freyermu 18415 0.0 0.0 4116 312 ? SN 18:22 0:00 | \_ /bin/sleep 180 freyermu 18503 0.0 0.0 21980 1536 ? S 18:23 0:00 \_ /bin/sh /usr/libexec/condor/condor_ssh_to_job_sshd_setup /pool/condor/dir_18316 /usr/libexec/condor/condor_ssh_to_job_shell_setup /etc/condor/condor_ssh_to_job_sshd_config_template "/usr/bin/ssh-keygen" "-N" "" "-C" "" "-q" "-f" "%f" "-t" "rsa" Finally, SSH is started outside of the container: freyermu 18402 1.0 0.0 20000 832 ? SNs 18:22 0:00 \_ /usr/libexec/singularity/bin/action-suid /bin/sleep 180 freyermu 18414 0.0 0.0 27288 856 ? SN 18:22 0:00 | \_ shim-init /bin/sleep 180 freyermu 18415 0.0 0.0 4116 312 ? SN 18:22 0:00 | \_ /bin/sleep 180 freyermu 18518 22.0 0.0 125228 4616 ? SNs 18:23 0:00 \_ sshd: freyermu [priv] And then, I see this: root 18544 0.0 0.0 112728 976 pts/0 S+ 18:23 0:00 \_ grep --color=auto freyermu freyermu 18402 1.0 0.0 0 0 ? ZNs 18:22 0:00 \_ [action-suid] <defunct> freyermu 18518 23.0 0.0 125228 4676 ? SNs 18:23 0:00 \_ sshd: freyermu [priv] freyermu 18539 0.0 0.0 125228 1796 ? SN 18:23 0:00 \_ sshd: freyermu@pts/2 freyermu 18540 0.0 0.0 56000 4584 pts/2 SNs+ 18:23 0:00 \_ /usr/bin/condor_docker_enter In the logs, I only find: Feb 26 18:23:01 wn022 condor_starter[18316]: Create_Process succeeded, pid=18518 Feb 26 18:23:01 wn022 condor_starter[18316]: Limiting (soft) memory usage to 0 bytes Feb 26 18:23:01 wn022 condor_starter[18316]: Limiting memsw usage to 9223372036854775807 bytes Feb 26 18:23:01 wn022 condor_starter[18316]: Limiting (hard) memory usage to 104857600 bytes Feb 26 18:23:01 wn022 condor_starter[18316]: Limiting memsw usage to 267144892416 bytes Feb 26 18:23:01 wn022 condor_starter[18316]: Process exited, pid=18503, status=0 Feb 26 18:23:01 wn022 condor_starter[18316]: unhandled job exit: pid=18503, status=0 Feb 26 18:23:01 wn022 condor_starter[18316]: Accepted new connection from ssh client for container job Feb 26 18:23:01 wn022 condor_starter[18316]: singularity enter_ns returned pid 18546 Feb 26 18:23:01 wn022 condor_starter[18316]: Process exited, pid=18402, status=255 Checking /usr/libexec/condor/condor_ssh_to_job_shell_setup, though, I find the code: # kill the dummy sleep job if this is an interactive job if grep -q '^InteractiveJob = true' "${_CONDOR_SCRATCH_DIR}/.job.ad"; then if [ "${_CONDOR_JOB_PIDS}" != "" ]; then kill "${_CONDOR_JOB_PIDS}" 2>/dev/null _CONDOR_JOB_PIDS="" fi fi So probably, this only fails for interactive jobs, since the sleep is reaped before we attach? I can't test witha batch job right now since I am already in the middle of the downgrade (and we still lack a proper test setup), but I'll try. Cheers, Oliver
-greg
-- Oliver Freyermuth UniversitÃt Bonn Physikalisches Institut, Raum 1.047 NuÃallee 12 53115 Bonn -- Tel.: +49 228 73 2367 Fax: +49 228 73 7869 --
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature