On 09/27/2013 07:44 AM, Paul Brenner wrote:
Hello,
We have a large Linux based Condor pool (5,000 - 15,000 slots
depending on opportunistic availability). We also have numerous
front end server nodes (all with >32GB of RAM) from which our campus
users can submit jobs.
In terms of scale we have reached a limitation due to the
condor_shadow process's need to hold open file descriptors. When we
set the max number of concurrent jobs per submit host to 2000 we
occasionally have users who get nearly 2000 jobs running prior to
crashing the submit host.
The submit host crashes due to exhaustion of the maximum number of
file descriptors. We already have raised this default setting in RH
Linux to 65,536
Paul:
The problem here is that the shadow uses a surprisingly large number
of file descriptors. It turns out that on Linux, one fd is consumed
per shared library that an application links with, for the duration of
the process. Depending on the platform, the shadow is dynamically
linked to something like 40 libraries, so 64k open fds is probably
still not enough for 2000 jobs. Can you up the per-user fd limit by a
factor of 10 more?
We'll update the wiki with this information.