Hi Paul,
On Sep 27, 2013, at 2:04 PM, Paul Brenner <paul.r.brenner@xxxxxx> wrote:
> RedHat 6.X with 128GB of RAM and 64cores on the frontend that was most recently crashed. I would certainly guess that scientific linux defaults are much higher for the "non enterprise workloads". I guess if you figure a "default" RedHat/CentOS in the enterprise world may be a modest LAMPS webserver the file descriptor defaults may not be that surprising. As mentioned we have used the current configuration for many years with Grid Engine running 10K+ concurrent jobs and never experienced a file descriptor limitation. In the world of Condor/HTC the configuration tuning is definitely weighted differently.
>
> Good to know that SL runs at a million plus as a "default". Our smallest head node has 32GB of RAM so growing 10x from 65K to 650K should be reasonably low risk.
>
Actually, the behavior I described (initial limit based on installed memory) is a kernel default (not a SL-specific tuning). I'm surprised that RHEL would turn it lower! I'm kinda scratching my head on that one.
I didn't know that running out of file descriptors could crash the kernel. What do the tracebacks look like?
Brian