We are seeing the following message in sched log file. Confirmed it's not a permission issue. This issue happens when users try to submit the batch of jobs (approx 1-4k jobs in a batch) intermittently sched get unresponsive.Â
06/05/24 10:35:45 (pid:370014) condor_write(): Socket closed when trying to write 588 bytes to <10.xx.xx.xx:26799>, fd is 84
06/05/24 10:35:45 (pid:370014) Buf::write(): condor_write() failed
06/05/24 10:35:45 (pid:370014) Failed to send done message to job query.
06/05/24 10:35:45 (pid:31558) Number of Active Workers 3
06/05/24 10:35:45 (pid:370015) condor_write() failed: send() 101 bytes to <10.xx.xx.xx:26107> returned -1, timeout=15, errno=32 Broken pipe.
06/05/24 10:35:45 (pid:370015) Buf::write(): condor_write() failed
Checking the hostname corresponding to IP addresses shows all the masters (including primary and flocked + sched itself is in the hostname list of "Socket closed" No worker node shows in this list)
# grep 'Socket closed when trying to write ' /var/log/condor/SchedLog | awk '{print $14}' | awk -F'[><]' '{print $2}' | awk -F':' '{if ($1 ~ /10/) {print $1}}' | sort | uniq | xargs -I{} host {}
Interface drop or network reliability is fine and doesn't show any problem. What could be the reason for socket closed messages?Â
Version is 9.0.17 everywhere.Â
Thanks & Regards,
Vikrant Aggarwal