Dear Greg, Am 26.02.19 um 18:18 schrieb Greg Thain:
On 2/26/19 11:09 AM, Oliver Freyermuth wrote:Dear HTCondor experts, dear Greg, trying a dirty hack to replace "-a" with "-m -u -i -n -p -U" still makes things fail miserably, since Singularity has somehow already exited when nsenter is called:How has Singularity exited? It should still be running the job at that time?
I'm also rather stupefied by this.
Here's what I see with 10 millisecond process tree snapshots.
First, all is well:
freyermu 18402 2.0 0.0 20000 832 ? SNs 18:22 0:00 \_ /usr/libexec/singularity/bin/action-suid /bin/sleep 180
freyermu 18414 0.0 0.0 27288 856 ? SN 18:22 0:00 \_ shim-init /bin/sleep 180
freyermu 18415 0.0 0.0 4116 312 ? SN 18:22 0:00 \_ /bin/sleep 180
Then, condor_ssh_to_job is prepared:
freyermu 18402 2.0 0.0 20000 832 ? SNs 18:22 0:00 \_ /usr/libexec/singularity/bin/action-suid /bin/sleep 180
freyermu 18414 0.0 0.0 27288 856 ? SN 18:22 0:00 | \_ shim-init /bin/sleep 180
freyermu 18415 0.0 0.0 4116 312 ? SN 18:22 0:00 | \_ /bin/sleep 180
freyermu 18503 0.0 0.0 21980 1536 ? S 18:23 0:00 \_ /bin/sh /usr/libexec/condor/condor_ssh_to_job_sshd_setup /pool/condor/dir_18316 /usr/libexec/condor/condor_ssh_to_job_shell_setup /etc/condor/condor_ssh_to_job_sshd_config_template "/usr/bin/ssh-keygen" "-N" "" "-C" "" "-q" "-f" "%f" "-t" "rsa"
Finally, SSH is started outside of the container:
freyermu 18402 1.0 0.0 20000 832 ? SNs 18:22 0:00 \_ /usr/libexec/singularity/bin/action-suid /bin/sleep 180
freyermu 18414 0.0 0.0 27288 856 ? SN 18:22 0:00 | \_ shim-init /bin/sleep 180
freyermu 18415 0.0 0.0 4116 312 ? SN 18:22 0:00 | \_ /bin/sleep 180
freyermu 18518 22.0 0.0 125228 4616 ? SNs 18:23 0:00 \_ sshd: freyermu [priv]
And then, I see this:
root 18544 0.0 0.0 112728 976 pts/0 S+ 18:23 0:00 \_ grep --color=auto freyermu
freyermu 18402 1.0 0.0 0 0 ? ZNs 18:22 0:00 \_ [action-suid] <defunct>
freyermu 18518 23.0 0.0 125228 4676 ? SNs 18:23 0:00 \_ sshd: freyermu [priv]
freyermu 18539 0.0 0.0 125228 1796 ? SN 18:23 0:00 \_ sshd: freyermu@pts/2
freyermu 18540 0.0 0.0 56000 4584 pts/2 SNs+ 18:23 0:00 \_ /usr/bin/condor_docker_enter
In the logs, I only find:
Feb 26 18:23:01 wn022 condor_starter[18316]: Create_Process succeeded, pid=18518
Feb 26 18:23:01 wn022 condor_starter[18316]: Limiting (soft) memory usage to 0 bytes
Feb 26 18:23:01 wn022 condor_starter[18316]: Limiting memsw usage to 9223372036854775807 bytes
Feb 26 18:23:01 wn022 condor_starter[18316]: Limiting (hard) memory usage to 104857600 bytes
Feb 26 18:23:01 wn022 condor_starter[18316]: Limiting memsw usage to 267144892416 bytes
Feb 26 18:23:01 wn022 condor_starter[18316]: Process exited, pid=18503, status=0
Feb 26 18:23:01 wn022 condor_starter[18316]: unhandled job exit: pid=18503, status=0
Feb 26 18:23:01 wn022 condor_starter[18316]: Accepted new connection from ssh client for container job
Feb 26 18:23:01 wn022 condor_starter[18316]: singularity enter_ns returned pid 18546
Feb 26 18:23:01 wn022 condor_starter[18316]: Process exited, pid=18402, status=255
Checking /usr/libexec/condor/condor_ssh_to_job_shell_setup, though, I find the code:
# kill the dummy sleep job if this is an interactive job
if grep -q '^InteractiveJob = true' "${_CONDOR_SCRATCH_DIR}/.job.ad"; then
if [ "${_CONDOR_JOB_PIDS}" != "" ]; then
kill "${_CONDOR_JOB_PIDS}" 2>/dev/null
_CONDOR_JOB_PIDS=""
fi
fi
So probably, this only fails for interactive jobs, since the sleep is reaped before we attach?
I can't test witha batch job right now since I am already in the middle of the downgrade (and we still lack a proper test setup), but I'll try.
Cheers,
Oliver
-greg
-- Oliver Freyermuth UniversitÃt Bonn Physikalisches Institut, Raum 1.047 NuÃallee 12 53115 Bonn -- Tel.: +49 228 73 2367 Fax: +49 228 73 7869 --
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature