SCHEDD_QUERY_WORKERS applies only to queries. This is because queries be done without modifiying the state of the job queue, so it is save to create a forked worker to handle the query.
Anything that changes the state or contents of the job queue, cannot be done with a forked worker, so changing this number
has no impact on the number of simultaneous submits that can be performed.
The Schedd enforces that only one submit command can be run at any one time, and while that command is running, the Schedd will not respond to any other command, any forked worker that started before the submit will continue to process queries, but
new commands, including queries will wait. If they wait too long they will timeout.
So you can't really have simultaneous submits, only one will be processed at a time, and the rest will wait with a timeout.
A failed submit will not corrupt the queue, it will either succeeed, or it will fail without making any changes.
So all you really need to do is to make sure that you aren't writing scripts that try and submit jobs simultaneously from multiple threads or processess. Or make sure that you retry the submit if it times out.
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Geonmo Ryu <geonmo@xxxxxxxxxxx>
Sent: Wednesday, July 24, 2024 7:47 PM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: [HTCondor-users] Settings for handling multiple condor_submit Hello, everyone.
We are using SCHEDD_QUERY_WORKERS by increasing it from the original value of 8 to 50.
We suspect that this setting allows users to submit condor_submit at the same time.
Multiple simultaneous condor_submits seem to cause the SharedPort error shown below.
07/24/24 17:25:17 SharedPortServer: server was busy, failed to connect schedd_630778_d213 as requested by STARTD <134.75.127.149:9618?addrs=134.75.127.149-9618&alias=bio-wn3012.sdfarm.kr&noUDP&sock=startd_266664_e6fe> on <134.75.127.149:33761>: primary (<cookie>/schedd_630778_d213): Connection refused (111); alt (/var/lock/condor/daemon_sock/schedd_630778_d213): Connection refused (111)
Is there a command that prevents condor_submit from being submitted simultaneously, or a way to ensure that when it is submitted simultaneously, it is done without the error?
Also, I'm having issues when the job submission is long and has crossed the negotiation cycle, is there a workaround?
Regards,
-- Geonmo
|