[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Python bindings abort with: terminate called after throwing an instance of 'boost::python::error_already_set'



Hi, we use the htcondor python bindings and we have been running into an issue where the bindings randomly abort the program with the above error. It happens infrequently and non-deterministically.

I did some digging and I believe I know the cause of the issue. Basically, when there are transient I/O errors (network disconnected, etc), then htcondor attempts to raise exceptions to the client. However, if this happens inside of a C++ destructor, then it immediately crashes the program because C++ destructors are supposed to be noexcept.

I put together the following small example just to demonstrate the issue (using htcondor and psutil both from conda-forge):

 # htcondor_bug.py
 import htcondor
 import os
 import psutil
 import sys

 disconnect_before_destructor = int(sys.argv[1])

 def bug():
  schedd = htcondor.Schedd()

  # Start a connection
  txn = schedd.transaction()

  # Force-close the socket to simulate I/O issues
  os.close(psutil.Process().connections()[-1].fd)

  if disconnect_before_destructor:
   # Here we will raise a regular exception
   with txn:
    pass
  else:
   # Program will abort with 'boost::python::error_already_set'
   # on the way out of the function
   pass

 bug()

Normal case where we get a clean exception:

 $ python ./htcondor_bug.py 1
 [...]/lib/python3.12/site-packages/htcondor/_deprecation.py:41: FutureWarning: Schedd.transaction() was deprecated in version 10.7.0 and will be removed in a future release. Use Schedd.submit() instead.
  warnings.warn(message, FutureWarning)
 Traceback (most recent call last):
  File "[...]/htcondor_bug.py", line 26, in <module>
   bug()
  File "[...]/htcondor_bug.py", line 19, in bug
   with txn:
  File "[...]/lib/python3.12/site-packages/htcondor/_lock.py", line 100, in __exit__
   return self.cm.__exit__(*args, **kwargs)
      Â^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/lib/python3.12/site-packages/htcondor/_lock.py", line 70, in wrapper
   rv = func(*args, **kwargs)
     Â^^^^^^^^^^^^^^^^^^^^^
 htcondor.HTCondorIOError: Failed to commit and disconnect from queue.

Buggy case where we abort the program uncleanly:

 $ python ./htcondor_bug.py 0
 [...]/lib/python3.12/site-packages/htcondor/_deprecation.py:41: FutureWarning: Schedd.transaction() was deprecated in version 10.7.0 and will be removed in a future release. Use Schedd.submit() instead.
  warnings.warn(message, FutureWarning)
 terminate called after throwing an instance of 'boost::python::error_already_set'
 Aborted

Please note, I'm using Schedd.transaction() here just as an easy way to demonstrate the issue deterministically. In our production code we do use Schedd.submit() but the problem is that I/O errors can still happen at inopportune times and cause a C++ exception to be thrown inside of a destructor.

What can be done about this? IMO it would be best to enforce that all destructors are noexcept to avoid aborting the program. And from the python user's perspective, functions like Schedd.submit() can raise exceptions if something goes wrong, but we should avoid raising exceptions purely at the time of object destruction, e.g. it shouldn't be possible to get a python exception in reaction to a Schedd object's reference count reaching 0.