Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] condor_shadow and file descriptors
- Date: Fri, 27 Sep 2013 09:02:33 -0500
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] condor_shadow and file descriptors
On 09/27/2013 07:44 AM, Paul Brenner wrote:
Hello,
We have a large Linux based Condor pool (5,000 - 15,000 slots
depending on opportunistic availability). We also have numerous front
end server nodes (all with >32GB of RAM) from which our campus users
can submit jobs.
In terms of scale we have reached a limitation due to the
condor_shadow process's need to hold open file descriptors. When we
set the max number of concurrent jobs per submit host to 2000 we
occasionally have users who get nearly 2000 jobs running prior to
crashing the submit host.
The submit host crashes due to exhaustion of the maximum number of
file descriptors. We already have raised this default setting in RH
Linux to 65,536
Paul:
The problem here is that the shadow uses a surprisingly large number of
file descriptors. It turns out that on Linux, one fd is consumed per
shared library that an application links with, for the duration of the
process. Depending on the platform, the shadow is dynamically linked to
something like 40 libraries, so 64k open fds is probably still not
enough for 2000 jobs. Can you up the per-user fd limit by a factor of
10 more?
We'll update the wiki with this information.
-Greg