[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Sched SECMAN:2007:Failed to end classad message.



On Jun 6, 2024, at 12:01âPM, Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

On 6/6/24 11:57, Jaime Frey via HTCondor-users wrote:
The usual cause for this error (socket closed unexpectedly) is that the schedd is very busy and took too long to respond to the clientâs request (usual client timeout is 20 seconds). The SchedLog doesnât show when the clientâs request first arrived, so itâs hard to confirm this without data from the client side. This particular request was a condor_q (or htcondor job status) command.
If you see a large time gap in the SchedLog just before this error, that can indicate a singular issue tying up the schedd as the cause, instead of too much work to do.


Also note that there have been a large number of performance improvements to the system since condor 9.0.

Also also note that submitting 1k-4k jobs in a single submit action can take a number of seconds and the schedd will ignore all other requests until the submit is done.

In most cases, submitting a large batch of jobs can be made much more efficient by using late materiaiization: https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#submitting-lots-of-jobs

 - Jaime