Hello, I am using HTCondor LTS 24.0.7 on several windows 10 / 11 machines. When submitting a small number of jobs (~20), these finish without any problems. Once I increase the number of jobs (100+) the first few of these jobs will still complete successfully. After that, it seems that the whole
process gets stuck in an infinite loop, with running jobs switching between idle and running state and none of these jobs ever completing. All jobs are pretty much the same, only input parameters change. Each job should usually take roughly a minute to finish. Here is an excerpt from one starter log (CPU Slot1) of one of the execution machines (which in this case also happens to be the central manager and is also the machine that the jobs have been submitted from). I have marked the lines that look suspicious to me. 06/04/25 10:09:14 (pid:7344) ****************************************************** 06/04/25 10:09:14 (pid:7344) ** condor_starter (CONDOR_STARTER) STARTING UP 06/04/25 10:09:14 (pid:7344) ** C:\condor\bin\condor_starter.exe 06/04/25 10:09:14 (pid:7344) ** SubsystemInfo: name=STARTER type=STARTER(7) class=DAEMON(1) 06/04/25 10:09:14 (pid:7344) ** Configuration: subsystem:STARTER local:slot_type_1 class:DAEMON 06/04/25 10:09:14 (pid:7344) ** $CondorVersion: 24.0.7 2025-04-22 BuildID: 803687 GitSHA: 51b71b5c $ 06/04/25 10:09:14 (pid:7344) ** $CondorPlatform: x86_64_Windows10 $ 06/04/25 10:09:14 (pid:7344) ** PID = 7344 06/04/25 10:09:14 (pid:7344) ** Log last touched 6/4 10:08:52 06/04/25 10:09:14 (pid:7344) ****************************************************** 06/04/25 10:09:14 (pid:7344) Using config source: C:\condor\condor_config 06/04/25 10:09:14 (pid:7344) Using local config sources:
06/04/25 10:09:14 (pid:7344) C:\condor\condor_config.local 06/04/25 10:09:14 (pid:7344) C:\condor\condor_config.local.credd 06/04/25 10:09:14 (pid:7344) config Macros = 78, Sorted = 76, StringBytes = 1876, TablesBytes = 2864 06/04/25 10:09:14 (pid:7344) CLASSAD_CACHING is OFF 06/04/25 10:09:14 (pid:7344) Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS 06/04/25 10:09:14 (pid:7344) SharedPortEndpoint: listener already created. 06/04/25 10:09:14 (pid:7344) DaemonCore: command socket at <192.168.101.45:9618?addrs=192.168.101.45-9618&alias=<machine>.<domain.ext>&noUDP&sock=slot1_17160_b616_5544> 06/04/25 10:09:14 (pid:7344) DaemonCore: private command socket at <192.168.101.45:9618?addrs=192.168.101.45-9618&alias=<machine>.<domain.ext>&noUDP&sock=slot1_17160_b616_5544> 06/04/25 10:09:14 (pid:7344) Communicating with shadow <192.168.101.45:9618?addrs=192.168.101.45-9618&alias=<machine>.<domain.ext>&noUDP&sock=shadow_13780_5fdc_40858> 06/04/25 10:09:14 (pid:7344) Submitting machine is "<machine>.<domain.ext>" 06/04/25 10:09:14 (pid:7344) setting the orig job name in starter 06/04/25 10:09:14 (pid:7344) setting the orig job iwd in starter 06/04/25 10:09:14 (pid:7344) Chirp config summary: IO false, Updates false, Delayed updates true. 06/04/25 10:09:14 (pid:7344) Initialized IO Proxy. 06/04/25 10:09:14 (pid:7344) Setting resource limits not implemented! 06/04/25 10:09:14 (pid:7344) Set filetransfer runtime ads to C:\condor\execute\dir_7344\.job.ad and C:\condor\execute\dir_7344\.machine.ad. 06/04/25 10:09:14 (pid:7344) Not entering transfer queue because sandbox (2137310) is too small (<= 104857600). 06/04/25 10:09:16 (pid:7344) File transfer completed successfully. 06/04/25 10:09:16 (pid:7344) Job 635.13 set to execute immediately 06/04/25 10:09:16 (pid:7344) Starting a VANILLA universe job with ID: 635.13 06/04/25 10:09:16 (pid:7344) IWD: C:\condor\execute\dir_7344 06/04/25 10:09:16 (pid:7344) Input file: C:\condor\execute\dir_7344\Fa0_V1600_00.inf 06/04/25 10:09:16 (pid:7344) Output file: C:\condor\execute\dir_7344\_condor_stdout 06/04/25 10:09:16 (pid:7344) Error file: C:\condor\execute\dir_7344\_condor_stderr 06/04/25 10:09:17 (pid:7344) Renice expr "10" evaluated to 10 06/04/25 10:09:17 (pid:7344) Running job as user dennis.neuhaus 06/04/25 10:09:17 (pid:7344) About to exec C:\condor\execute\dir_7344\Local_TS_Sim.01.bat Fa0_V1600_00.inf 06/04/25 10:09:17 (pid:7344) Executable is a batch file, running: "C:\WINDOWS\system32\cmd.exe" /Q /C "C:\condor\execute\dir_7344\Local_TS_Sim.01.bat" Fa0_V1600_00.inf 06/04/25 10:09:17 (pid:7344) Create_Process succeeded, pid=15824 06/04/25 10:10:20 (pid:7344) Process exited, pid=15824, status=0 06/04/25 10:10:20 (pid:7344) Not entering transfer queue because sandbox (24504674) is too small (<= 104857600). 06/04/25 10:10:21 (pid:7344) condor_read(): Socket closed abnormally when trying to read 5 bytes from daemon at <192.168.101.45:9618>, errno=10054
06/04/25 10:10:21 (pid:7344) DoUpload: STARTER at 192.168.101.45 failed to send file(s) to <192.168.101.45:9618> 06/04/25 10:10:21 (pid:7344) condor_write(): Socket closed when trying to write 163 bytes to <192.168.101.45:63778>, fd is 1996, errno=10054
06/04/25 10:10:21 (pid:7344) Buf::write(): condor_write() failed 06/04/25 10:10:21 (pid:7344) i/o error result is 0, errno is 0 (No error) 06/04/25 10:10:21 (pid:7344) Lost connection to shadow, last activity was 1 secs ago, waiting 2399 secs for reconnect 06/04/25 10:10:21 (pid:7344) File transfer failed, forcing disconnect. 06/04/25 10:10:21 (pid:7344) Returning from Starter::JobReaper() 06/04/25 10:10:21 (pid:7344) Result of "unregister_family" operation from ProcD: ERROR: No family with the given PID is registered 06/04/25 10:10:21 (pid:7344) error unregistering pid 15824 with the procd 06/04/25 10:10:21 (pid:7344) Got SIGTERM. Performing graceful shutdown. 06/04/25 10:10:21 (pid:7344) ShutdownGraceful all jobs. 06/04/25 10:10:21 (pid:7344) RPC error: disconnected from shadow 06/04/25 10:10:21 (pid:7344) RPC error: disconnected from shadow 06/04/25 10:10:21 (pid:7344) Failed to send job exit status to shadow 06/04/25 10:10:21 (pid:7344) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 06/04/25 10:10:23 (pid:7344) Got SIGTERM, but we've already started graceful shutdown. Ignoring. 06/04/25 10:10:51 (pid:7344) Got SIGQUIT. Performing fast shutdown. 06/04/25 10:10:51 (pid:7344) ShutdownFast all jobs. 06/04/25 10:10:51 (pid:7344) RPC error: disconnected from shadow 06/04/25 10:10:51 (pid:7344) Failed to send job exit status to shadow 06/04/25 10:10:51 (pid:7344) All jobs have exited... starter exiting 06/04/25 10:10:51 (pid:7344) **** condor_starter (condor_STARTER) pid 7344 EXITING WITH STATUS 0 I can see that the jobs are actually running (by looking into the execute folder) and they produce the expected output (files). The log reveals that the process takes about a minute – as expected. There seems to be an
issue with transferring the results back. errno=10054
also indicates that there is a problem with the network connection. Our corporate network does seem to work fine from a normal user’s perspective (i.e. working with samba shares etc.). Could there still be a performance problem and should I get in touch with our system administrator? Or is there an other explanation for this behavior? Is there some setting that I can make in the condor_config files that may at least mitigate these issues?
Dennis
Die Information in dieser E-Mail ist vertraulich und ist ausschließlich fuer den Adressaten bestimmt. Jeglicher Zugriff auf diese E-Mail durch andere Personen als den Adressaten
ist untersagt. Sollten Sie nicht der fuer diese E-Mail bestimmte Adressat sein, ist Ihnen jede Veröffentlichung, Vervielfaeltigung oder Weitergabe wie auch das Ergreifen oder Unterlassen von Maßnahmen im Vertrauen auf erlangte Information untersagt. In dieser
E-Mail enthaltene Meinungen oder Empfehlungen unterliegen den Bedingungen des jeweiligen Vertrgasverhältnisses zwischen Absender und Adressaten. The content of this e-mail message is confidential and intended solely for the use oft he addressee. Any access to this e-mail message by any other person
than the addressee is prohibited. If you are not the intended recipient of this e-mail message, please note that any dissemination, copying or distribution as well as any other use of the content of this message is strictly prohibited. Sentiments as well as
recommendations contained in this e-mail message are subject to the conditions between
addresser
and addressee. |