[HTCondor-users] Sched SECMAN:2007:Failed to end classad message.

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

HelloÂExperts,

We are seeing the following message in sched log file. Confirmed it's not a permission issue. This issue happens when users try to submit the batch of jobs (approx 1-4k jobs in a batch) intermittently sched get unresponsive.Â

06/05/24 10:35:45 (pid:370014) condor_write(): Socket closed when trying to write 588 bytes to <10.xx.xx.xx:26799>, fd is 84
06/05/24 10:35:45 (pid:370014) Buf::write(): condor_write() failed
06/05/24 10:35:45 (pid:370014) Failed to send done message to job query.
06/05/24 10:35:45 (pid:31558) Number of Active Workers 3
06/05/24 10:35:45 (pid:370015) condor_write() failed: send() 101 bytes to <10.xx.xx.xx:26107> returned -1, timeout=15, errno=32 Broken pipe.
06/05/24 10:35:45 (pid:370015) Buf::write(): condor_write() failed

Checking the hostname corresponding to IP addresses shows all the masters (including primary and flocked + sched itself is in the hostname list of "Socket closed" No worker node shows in this list)

# grep 'Socket closed when trying to write ' /var/log/condor/SchedLog | awk '{print $14}' | awk -F'[><]' '{print $2}' | awk -F':' '{if ($1 ~ /10/) {print $1}}' | sort | uniq | xargs -I{} host {}

Interface drop or network reliability is fine and doesn't show any problem. What could be the reason for socket closed messages?Â

Version is 9.0.17 everywhere.Â

Thanks & Regards,

Vikrant Aggarwal

Mailing List Archives

Authenticated access

[HTCondor-users] Sched SECMAN:2007:Failed to end classad message.