Hi again, Brian and I looked at this off-list and so far, it seems we hit a TCP limit on the host, though it is not yet clear to me, what the problem is. Most telling are lines like 05/21/20 19:19:42 (2076445.0) (2492275): attempt to connect to <10.20.30.17:14867> failed: Connection timed out (connect errno = 110). Will keep trying for 300 total seconds (269 to go). from the ShadowLog where the shadows cannot connect to the schedd on the very same host anymore... The current work-around for us is to use MAX_JOBS_RUNNING = 30000 as we saw this happening when we had about 40k shadow processes on the host. We have tried playing with the usual sysctl suspects, e.g. net.core.somaxconn net.core.netdev_max_backlog net.ipv4.ip_local_port_range net.ipv4.tcp_fin_timeout net.ipv4.tcp_max_syn_backlog but to no avail :( If anyone has submit hosts on Linux beyond 40k active shadow, please let us know what we are missing ;-) Cheers Carsten -- Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics, CallinstraÃe 38, 30167 Hannover, Germany Phone: +49 511 762 17185
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature