HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] condor and a threaded library



Todd Tannenbaum wrote:


There's always the worry that we're calling some non-threadsafe C library
function that the library has properly protected but Condor hasn't. That
should be less and less of a worry as time goes on.

You mean that the threaded library code calls a non-reentrant C lib function, but protects itself in some sane way. However, Condor calls the same function, except without the library's protection?


Correct. Typical culprits include functions to get host and user information. Also, even if the library uses the "*_r()" thread safe versions of function X, it is not clear if the entire process needs to use the *_r functions, or if it is ok for condor to use the thread-unsafe functions combined with _r functions in the library.

Besides C library functions, ditto of the above for things like OpenSSL.

Signal handling could become a pain as well. Condor daemons use signals to communicate with each other --- combining signals and threads in one process may require work. Useful post-mortem debugging could be painful. (how well will the google core dump library continue to work, if at all? which thread context gets dumped?)

Noted.


It's also usually the case that in order to do anything useful, somewhere in that library you have to call back into the Condor code, and then all bets are
off. Non-reenterant Condor utility functions are all over the place, and
there are many places that hold state between calls to the same function.

That's not a concern because the library will not call back to Condor.

If it will not call back into Condor, why link it into the Condor daemons? Just to send outbound notifications? Depending upon what sort of notifications and their frequency, perhaps a better/simpler/safer/more modular design could be found....

Initially the concern is only for outbound notifications. However, there will be inbound notifications. For inbound the library can provide an fd for DC to select() on so that any execution of Condor code by the library would be coming from a Condor controlled thread of control. The library would not call into any Condor code from a thread it controls.


Also, the library would not call exit(), which Condor redefines. Are there other functions that Condor redefines I'm forgetting? open/close are #define'd, so they are not a concern.


Sure, lots of them.    See the util lib.

Thanks. That's a list of #define'd and reimplemented functions. I'll see if I can wade through it and figure out which are which. I know exit() is reimplemented for sure because it is done in daemon_core.


We've been burned time and time again by threads on Windows, even though we were convinced that they don't interact with Condor code, and lo, it turns out we still crash because of race condidtions. It just doesn't seem worth it.

This is because of call-backs or Condor code itself being threaded, right?


That and the items listed above and others not enumerated. Plus we don't necessarily want threading in Condor, except for very specific tasks such as overlap of CPU and I/O, where what is going on in the worker thread is small/simple to the point of nearly a single system call. We may be forced to also do OpenSSL functions in a thread pool at some point as well, but that doesn't make us happy...

See http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf, http://www.softpanorama.org/People/Ousterhout/Threads/ to understand some of our thinking. What we have thought about down the road for DaemonCore is a more hybrid thread/event model, something similar to http://research.microsoft.com/Farsite/USENIX2002.ps but simpler.


What are you thinking about using?

It's a client messaging library, architected to manages its own threads, which would never call back to Condor.



And it would be doing what?

As above, outbound notifications about state and later inbound notifications to change state.

Best,



matt