[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Sched SECMAN:2007:Failed to end classad message.



Thanks guys for the suggestions.

We are already looking into the late materialization. Version update to newer versions like 23.x will take some time.Â

We noticed this issue can happen with smaller batches as well.

Ex: In this case we do have approx 3500 jobs in queue and submitting new batches, I see a batch of 20 jobs getting submitted in the queue and causing the slowness issue. Approx 20-30 jobs submitted per min in queue - this number is not huge. Even if we move to late materialization we can't go below this number otherwise we will be having a long delay in job completions.Â

I do see the following logs and silence in the log file when the condor_q output starts hanging.Â

06/07/24 13:06:53 (pid:54976) job_transforms for 147247.0: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.1: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.2: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.3: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.4: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.5: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.6: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.7: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.8: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.9: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.10: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.11: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.12: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.13: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.14: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.15: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.16: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.17: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.18: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.19: 1 considered, 1 applied (SetTestTeam)
06/07/24 13:06:53 (pid:54976) job_transforms for 147247.20: 1 considered, 1 applied (SetTestTeam)Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â<<<<< 3 min logs missing from log file.Â
06/07/24 13:09:43 (pid:54976) condor_write(): Socket closed when trying to write 47 bytes to <10.xx.xx.xx:28479>, fd is 25

Our transform condition is very simple to add a tag in the job.Â


Thanks & Regards,f
Vikrant Aggarwal


On Thu, Jun 6, 2024 at 2:20âPM Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
On Jun 6, 2024, at 12:01âPM, Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

On 6/6/24 11:57, Jaime Frey via HTCondor-users wrote:
The usual cause for this error (socket closed unexpectedly) is that the schedd is very busy and took too long to respond to the clientâs request (usual client timeout is 20 seconds). The SchedLog doesnât show when the clientâs request first arrived, so itâs hard to confirm this without data from the client side. This particular request was a condor_q (or htcondor job status) command.
If you see a large time gap in the SchedLog just before this error, that can indicate a singular issue tying up the schedd as the cause, instead of too much work to do.


Also note that there have been a large number of performance improvements to the system since condor 9.0.

Also also note that submitting 1k-4k jobs in a single submit action can take a number of seconds and the schedd will ignore all other requests until the submit is done.

In most cases, submitting a large batch of jobs can be made much more efficient by using late materiaiization:Âhttps://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#submitting-lots-of-jobs

Â- Jaime
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/