Hi all, we have recently increased the size of our StartDs and are seeing strange failures during job starts. The machines have a single partitionable 192-CPU StartD versus the 2 x 96-CPU StartD layout we were using previously. The setup is puppetized to be the same aside from merging two partitionable StartDs into one. What we observe is that if the large machines pull jobs after draining, there is a huge number of failures when the Shadow requests the claim from the StartD. The StartD cannot reply because the socket is closed [0] and the Shadow times out waiting for the reply [1]. There are several dozens of these failures when things go wrong; it could be that the timeout happens before the failed write as well, we cannot match both sides accurately. Strangely, it looks like the critical volumes is between 96-100 jobs starting at once on the same StartD. Below that everything works fine, above that many more jobs fail than just the surplus. So it looks like we hit some limit at which Condor is not able to handle all the jobs at once. Is there any knob we should look at to help with many job starts? Some known issue, be it in Condor itself or if we messed up e.g. the networking? Should we just put a limit on how many jobs may start at once? Cheers, Max PS: In case itâs relevant, these are identical test jobs created with `queue 100` (or whatever volume we test with). [0] StartLog 12/03/21 12:12:24 (pid:3700) (D_ALWAYS) slot1_56: Got activate_claim request from shadow (2a00:139c:3:2e5:0:61:d2:6c) 12/03/21 12:12:24 (pid:3700) (D_ALWAYS) condor_write(): Socket closed when trying to write 29 bytes to <[2a00:139c:3:2e5:0:61:d2:6c]:15444>, fd is 12 12/03/21 12:12:24 (pid:3700) (D_ALWAYS) Buf::write(): condor_write() failed 12/03/21 12:12:24 (pid:3700) (D_ALWAYS) slot1_56: Can't send eom to shadow. [1] ShadowLog 12/03/21 12:12:37 (pid:3615484) (D_ALWAYS) (15310.259) (3615484): condor_read(): timeout reading 21 bytes from startd slot1@xxxxxxxxxxxxxxxxxxxxxx 12/03/21 12:12:37 (pid:3614255) (D_ALWAYS) (15310.701) (3614255): RemoteResource::killStarter(): Could not send command to startd
Attachment:
smime.p7s
Description: S/MIME cryptographic signature