Hi Stephen,
To answer the question of why the schedd cannot connect to a target
daemon that is no longer registered with CCB, it may help to look in the
target daemon's log file, if you can locate it. If the daemon is still
running at the time when it is not registered with CCB, you should see a
log message that says it became disconnected from CCB and you should
also see periodic attempts to reconnect to CCB. The log message showing
the disconnect from CCB may help understand why this is happening. If,
on the other hand, the daemon is not alive, then we need to understand
why. The log file may help with that too.
Regarding the exhaustion of file descriptors: if condor is started as
root (the default for an rpm installation), the best way to configure
the maximum number of file descriptors available to the collector is to
use something like the following configuration setting in the htcondor
config file:
COLLECTOR_MAX_FILE_DESCRIPTORS = 10000
When the collector starts up, you will see a line in the log file that
looks like this:
"Setting maximum file descriptors to 10000."
If condor is started as root, it can set its limit higher than the
default hard limit. If it is not started as root, then it can only
decrease the limit. I recommend using this configuration setting,
rather than trying to set the per-process default, because some
mechanisms for setting the per-process default (e.g. PAM settings) are
not necessarily applied to condor processes, and, anyway, the
consequences of having a huge file descriptor limit for all processes
may not be good. For example, many processes use more memory when the
file descriptor limit is high. For a process such as the condor_shadow,
this may add up to a lot of memory, since there may be many instances of
the shadow process.
--Dan
On 2/8/13 1:46 PM, Stephen Pietrowicz wrote:
Hi,
I'm seeing the following message a significant number of times in some
of the larger runs we've started to do:
02/08/13 12:22:26 CCB: rejecting request from SCHEDD
<www.xxx.yyy.zzz:50190> on <www.xxx.yyy.zzz:40460> for ccbid 6987
because no daemon is currently registered with that id (perhaps it
recently disconnected).
Eventually, we get:
**** PANIC -- OUT OF FILE DESCRIPTORS at line 175 in
/slots/01/dir_65060/userdir/src/condor_io/reli_sock.cpp
And in /var/log/messages, I'm seeing:
Feb 8 10:59:59 lsst-launch kernel: possible SYN flooding on port
9618. Sending cookies.
We had been running jobs of about 500 slots or so, and have started to
try and run at 1000+ slots simultaneously. The Collector machine and
the submit machine both have up-ed the number of file descriptors to
over 400,000 per process.
Any ideas?
Steve
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/