[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Sched SECMAN:2007:Failed to end classad message.



The usual cause for this error (socket closed unexpectedly) is that the schedd is very busy and took too long to respond to the clientâs request (usual client timeout is 20 seconds). The SchedLog doesnât show when the clientâs request first arrived, so itâs hard to confirm this without data from the client side. This particular request was a condor_q (or htcondor job status) command.
If you see a large time gap in the SchedLog just before this error, that can indicate a singular issue tying up the schedd as the cause, instead of too much work to do.

 - Jaime

> On Jun 5, 2024, at 2:38âPM, Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
> 
> Hello Experts,
> 
> We are seeing the following message in sched log file. Confirmed it's not a permission issue. This issue happens when users try to submit the batch of jobs (approx 1-4k jobs in a batch) intermittently sched get unresponsive. 
> 
> 06/05/24 10:35:45 (pid:370014) condor_write(): Socket closed when trying to write 588 bytes to <10.xx.xx.xx:26799>, fd is 84
> 06/05/24 10:35:45 (pid:370014) Buf::write(): condor_write() failed
> 06/05/24 10:35:45 (pid:370014) Failed to send done message to job query.
> 06/05/24 10:35:45 (pid:31558) Number of Active Workers 3
> 06/05/24 10:35:45 (pid:370015) condor_write() failed: send() 101 bytes to <10.xx.xx.xx:26107> returned -1, timeout=15, errno=32 Broken pipe.
> 06/05/24 10:35:45 (pid:370015) Buf::write(): condor_write() failed
> 
> Checking the hostname corresponding to IP addresses shows all the masters (including primary and flocked + sched itself is in the hostname list of "Socket closed" No worker node shows in this list)
> 
> # grep 'Socket closed when trying to write ' /var/log/condor/SchedLog | awk '{print $14}' | awk -F'[><]' '{print $2}' | awk -F':' '{if ($1 ~ /10/) {print $1}}' | sort | uniq | xargs -I{} host {}
> 
> Interface drop or network reliability is fine and doesn't show any problem. What could be the reason for socket closed messages? 
> 
> Version is 9.0.17 everywhere. 
> 
> 
> Thanks & Regards,
> Vikrant Aggarwal
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/