[HTCondor-users] Python bindings abort with: terminate called after throwing an instance of 'boost::python::error_already

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hi, we use the htcondor python bindings and we have been running into an issue where the bindings randomly abort the program with the above error. It happens infrequently and non-deterministically.

I did some digging and I believe I know the cause of the issue. Basically, when there are transient I/O errors (network disconnected, etc), then htcondor attempts to raise exceptions to the client. However, if this happens inside of a C++ destructor, then it immediately crashes the program because C++ destructors are supposed to be noexcept.

I put together the following small example just to demonstrate the issue (using htcondor and psutil both from conda-forge):

Â # htcondor_bug.py

Â import htcondor
Â import os
Â import psutil
Â import sys

Â disconnect_before_destructor = int(sys.argv[1])

Â def bug():
Â Â schedd = htcondor.Schedd()

Â Â # Start a connection
Â Â txn = schedd.transaction()

Â Â # Force-close the socket to simulate I/O issues
Â Â os.close(psutil.Process().connections()[-1].fd)

Â Â if disconnect_before_destructor:
Â Â Â # Here we will raise a regular exception
Â Â Â with txn:
Â Â Â Â pass
Â Â else:
Â Â Â # Program will abort with 'boost::python::error_already_set'
Â Â Â # on the way out of the function
Â Â Â pass

Â bug()

Normal case where we get a clean exception:

Â $ python ./htcondor_bug.py 1
Â [...]/lib/python3.12/site-packages/htcondor/_deprecation.py:41: FutureWarning: Schedd.transaction() was deprecated in version 10.7.0 and will be removed in a future release.Â Use Schedd.submit() instead.
Â Â warnings.warn(message, FutureWarning)
Â Traceback (most recent call last):
Â Â File "[...]/htcondor_bug.py", line 26, in <module>
Â Â Â bug()
Â Â File "[...]/htcondor_bug.py", line 19, in bug
Â Â Â with txn:
Â Â File "[...]/lib/python3.12/site-packages/htcondor/_lock.py", line 100, in __exit__
Â Â Â return self.cm.__exit__(*args, **kwargs)
Â Â Â Â Â Â Â^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Â Â File "[...]/lib/python3.12/site-packages/htcondor/_lock.py", line 70, in wrapper
Â Â Â rv = func(*args, **kwargs)
Â Â Â Â Â Â^^^^^^^^^^^^^^^^^^^^^
Â htcondor.HTCondorIOError: Failed to commit and disconnect from queue.

Buggy case where we abort the program uncleanly:

Â $ python ./htcondor_bug.py 0
Â [...]/lib/python3.12/site-packages/htcondor/_deprecation.py:41: FutureWarning: Schedd.transaction() was deprecated in version 10.7.0 and will be removed in a future release.Â Use Schedd.submit() instead.
Â Â warnings.warn(message, FutureWarning)
Â terminate called after throwing an instance of 'boost::python::error_already_set'
Â Aborted

Please note, I'm using Schedd.transaction() here just as an easy way to demonstrate the issue deterministically. In our production code we do use Schedd.submit() but the problem is that I/O errors can still happen at inopportune times and cause a C++ exception to be thrown inside of a destructor.

What can be done about this? IMO it would be best to enforce that all destructors are noexcept to avoid aborting the program. And from the python user's perspective, functions like Schedd.submit() can raise exceptions if something goes wrong, but we should avoid raising exceptions purely at the time of object destruction, e.g. it shouldn't be possible to get a python exception in reaction to a Schedd object's reference count reaching 0.

Mailing List Archives

Authenticated access

[HTCondor-users] Python bindings abort with: terminate called after throwing an instance of 'boost::python::error_already_set'