[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_shadow and file descriptors



On 9/27/13 9:02 AM, Greg Thain wrote:
On 09/27/2013 07:44 AM, Paul Brenner wrote:
Hello,

We have a large Linux based Condor pool (5,000 - 15,000 slots depending on opportunistic availability). We also have numerous front end server nodes (all with >32GB of RAM) from which our campus users can submit jobs.
In terms of scale we have reached a limitation due to the 
condor_shadow process's need to hold open file descriptors. When we 
set the max number of concurrent jobs per submit host to 2000 we 
occasionally have users who get nearly 2000 jobs running prior to 
crashing the submit host.
The submit host crashes due to exhaustion of the maximum number of 
file descriptors.  We already have raised this default setting in RH 
Linux to 65,536
Paul:

The problem here is that the shadow uses a surprisingly large number of file descriptors. It turns out that on Linux, one fd is consumed per shared library that an application links with, for the duration of the process. Depending on the platform, the shadow is dynamically linked to something like 40 libraries, so 64k open fds is probably still not enough for 2000 jobs. Can you up the per-user fd limit by a factor of 10 more?
We'll update the wiki with this information.
Paul, are you talking about /proc/sys/fs/file-max?  i.e. you have the 
system-wide limit set to 65536?  I would expect this number to be over a 
million on a typical server-class system.
--Dan