Hello,
We are submitting condor jobs that use singularity containers. The startds use the --nv feature, in order to bring GPU support inside the containers for Machine Learning applications:
SINGULARITY_EXTRA_ARGUMENTS = --nv
SINGULARITY_JOB = !isUndefined(TARGET.SingularityImage)
SINGULARITY_IMAGE_EXPR = TARGET.SingularityImage
This works great, however, when we use condor_ssh_to_job, we lose the environment related to libcuda (what --nv does), see [1]. Could it be that condor does not use --nv when entering the container?
Has anyone tried this?
[1]
[khurtado@camlnd ~]$ condor_ssh_to_job 60.0
Welcome to
slot1_2@xxxxxxxxxxxx!
Your condor job is running with pid(s) 63160.
-sh: cannot set terminal process group (-1): Inappropriate ioctl for device
-sh: no job control in this shell
-sh: /root/.profile: Permission denied
-sh-4.2$ cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)
-sh-4.2$ python
Python 2.7.5 (default, Aug Â7 2019, 00:51:29)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/usr/lib/python2.7/site-packages/tensorflow/__init__.py", line 24, in <module>
  from tensorflow.python import pywrap_tensorflow Â# pylint: disable=unused-import
 File "/usr/lib/python2.7/site-packages/tensorflow/python/__init__.py", line 49, in <module>
  from tensorflow.python import pywrap_tensorflow
 File "/usr/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
  raise ImportError(msg)
ImportError: Traceback (most recent call last):
 File "/usr/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
  from tensorflow.python.pywrap_tensorflow_internal import *
 File "/usr/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
  _pywrap_tensorflow_internal = swig_import_helper()
 File "/usr/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
  _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.
See
https://www.tensorflow.org/install/errorsfor some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.
Â