[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs results not transferring back - errno=10054



The errors you highlighted show that the condor_shadow daemon on the AP is closing the network connection when the starter attempts to do output transfer. The ShadowLog on the AP should have details on why itâs rejecting the transfer.

 - Jaime

On Jun 4, 2025, at 4:54âAM, Dennis Neuhaus <dennis.neuhaus@xxxxxxxxxxx> wrote:

Hello,
 
I am using HTCondor LTS 24.0.7 on several windows 10 / 11 machines.
When submitting a small number of jobs (~20), these finish without any problems. Once I increase the number of jobs (100+) the first few of these jobs will still complete successfully. After that, it seems that the whole process gets stuck in an infinite loop, with running jobs switching between idle and running state and none of these jobs ever completing. All jobs are pretty much the same, only input parameters change. Each job should usually take roughly a minute to finish.
 
Here is an excerpt from one starter log (CPU Slot1) of one of the execution machines (which in this case also happens to be the central manager and is also the machine that the jobs have been submitted from).
I have marked the lines that look suspicious to me.
 
06/04/25 10:09:14 (pid:7344) ******************************************************
06/04/25 10:09:14 (pid:7344) ** condor_starter (CONDOR_STARTER) STARTING UP
06/04/25 10:09:14 (pid:7344) ** C:\condor\bin\condor_starter.exe
06/04/25 10:09:14 (pid:7344) ** SubsystemInfo: name=STARTER type=STARTER(7) class=DAEMON(1)
06/04/25 10:09:14 (pid:7344) ** Configuration: subsystem:STARTER local:slot_type_1 class:DAEMON
06/04/25 10:09:14 (pid:7344) ** $CondorVersion: 24.0.7 2025-04-22 BuildID: 803687 GitSHA: 51b71b5c $
06/04/25 10:09:14 (pid:7344) ** $CondorPlatform: x86_64_Windows10 $
06/04/25 10:09:14 (pid:7344) ** PID = 7344
06/04/25 10:09:14 (pid:7344) ** Log last touched 6/4 10:08:52
06/04/25 10:09:14 (pid:7344) ******************************************************
06/04/25 10:09:14 (pid:7344) Using config source: C:\condor\condor_config
06/04/25 10:09:14 (pid:7344) Using local config sources:
06/04/25 10:09:14 (pid:7344)    C:\condor\condor_config.local
06/04/25 10:09:14 (pid:7344)    C:\condor\condor_config.local.credd
06/04/25 10:09:14 (pid:7344) config Macros = 78, Sorted = 76, StringBytes = 1876, TablesBytes = 2864
06/04/25 10:09:14 (pid:7344) CLASSAD_CACHING is OFF
06/04/25 10:09:14 (pid:7344) Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
06/04/25 10:09:14 (pid:7344) SharedPortEndpoint: listener already created.
06/04/25 10:09:14 (pid:7344) DaemonCore: command socket at <192.168.101.45:9618?addrs=192.168.101.45-9618&alias=<machine>.<domain.ext>&noUDP&sock=slot1_17160_b616_5544>
06/04/25 10:09:14 (pid:7344) DaemonCore: private command socket at <192.168.101.45:9618?addrs=192.168.101.45-9618&alias=<machine>.<domain.ext>&noUDP&sock=slot1_17160_b616_5544>
06/04/25 10:09:14 (pid:7344) Communicating with shadow <192.168.101.45:9618?addrs=192.168.101.45-9618&alias=<machine>.<domain.ext>&noUDP&sock=shadow_13780_5fdc_40858>
06/04/25 10:09:14 (pid:7344) Submitting machine is "<machine>.<domain.ext>"
06/04/25 10:09:14 (pid:7344) setting the orig job name in starter
06/04/25 10:09:14 (pid:7344) setting the orig job iwd in starter
06/04/25 10:09:14 (pid:7344) Chirp config summary: IO false, Updates false, Delayed updates true.
06/04/25 10:09:14 (pid:7344) Initialized IO Proxy.
06/04/25 10:09:14 (pid:7344) Setting resource limits not implemented!
06/04/25 10:09:14 (pid:7344) Set filetransfer runtime ads to C:\condor\execute\dir_7344\.job.ad and C:\condor\execute\dir_7344\.machine.ad.
06/04/25 10:09:14 (pid:7344) Not entering transfer queue because sandbox (2137310) is too small (<= 104857600).
06/04/25 10:09:16 (pid:7344) File transfer completed successfully.
06/04/25 10:09:16 (pid:7344) Job 635.13 set to execute immediately
06/04/25 10:09:16 (pid:7344) Starting a VANILLA universe job with ID: 635.13
06/04/25 10:09:16 (pid:7344) IWD: C:\condor\execute\dir_7344
06/04/25 10:09:16 (pid:7344) Input file: C:\condor\execute\dir_7344\Fa0_V1600_00.inf
06/04/25 10:09:16 (pid:7344) Output file: C:\condor\execute\dir_7344\_condor_stdout
06/04/25 10:09:16 (pid:7344) Error file: C:\condor\execute\dir_7344\_condor_stderr
06/04/25 10:09:17 (pid:7344) Renice expr "10" evaluated to 10
06/04/25 10:09:17 (pid:7344) Running job as user dennis.neuhaus
06/04/25 10:09:17 (pid:7344) About to exec C:\condor\execute\dir_7344\Local_TS_Sim.01.bat Fa0_V1600_00.inf
06/04/25 10:09:17 (pid:7344) Executable is a batch file, running: "C:\WINDOWS\system32\cmd.exe" /Q /C "C:\condor\execute\dir_7344\Local_TS_Sim.01.bat" Fa0_V1600_00.inf
06/04/25 10:09:17 (pid:7344) Create_Process succeeded, pid=15824
06/04/25 10:10:20 (pid:7344) Process exited, pid=15824, status=0
06/04/25 10:10:20 (pid:7344) Not entering transfer queue because sandbox (24504674) is too small (<= 104857600).
06/04/25 10:10:21 (pid:7344) condor_read(): Socket closed abnormally when trying to read 5 bytes from daemon at <192.168.101.45:9618>, errno=10054
06/04/25 10:10:21 (pid:7344) DoUpload: STARTER at 192.168.101.45 failed to send file(s) to <192.168.101.45:9618>
06/04/25 10:10:21 (pid:7344) condor_write(): Socket closed when trying to write 163 bytes to <192.168.101.45:63778>, fd is 1996, errno=10054
06/04/25 10:10:21 (pid:7344) Buf::write(): condor_write() failed
06/04/25 10:10:21 (pid:7344) i/o error result is 0, errno is 0 (No error)
06/04/25 10:10:21 (pid:7344) Lost connection to shadow, last activity was 1 secs ago, waiting 2399 secs for reconnect
06/04/25 10:10:21 (pid:7344) File transfer failed, forcing disconnect.
06/04/25 10:10:21 (pid:7344) Returning from Starter::JobReaper()
06/04/25 10:10:21 (pid:7344) Result of "unregister_family" operation from ProcD: ERROR: No family with the given PID is registered
06/04/25 10:10:21 (pid:7344) error unregistering pid 15824 with the procd
06/04/25 10:10:21 (pid:7344) Got SIGTERM. Performing graceful shutdown.
06/04/25 10:10:21 (pid:7344) ShutdownGraceful all jobs.
06/04/25 10:10:21 (pid:7344) RPC error: disconnected from shadow
06/04/25 10:10:21 (pid:7344) RPC error: disconnected from shadow
06/04/25 10:10:21 (pid:7344) Failed to send job exit status to shadow
06/04/25 10:10:21 (pid:7344) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
06/04/25 10:10:23 (pid:7344) Got SIGTERM, but we've already started graceful shutdown.  Ignoring.
06/04/25 10:10:51 (pid:7344) Got SIGQUIT.  Performing fast shutdown.
06/04/25 10:10:51 (pid:7344) ShutdownFast all jobs.
06/04/25 10:10:51 (pid:7344) RPC error: disconnected from shadow
06/04/25 10:10:51 (pid:7344) Failed to send job exit status to shadow
06/04/25 10:10:51 (pid:7344) All jobs have exited... starter exiting
06/04/25 10:10:51 (pid:7344) **** condor_starter (condor_STARTER) pid 7344 EXITING WITH STATUS 0
 
I can see that the jobs are actually running (by looking into the execute folder) and they produce the expected output (files). The log reveals that the process takes about a minute â as expected. There seems to be an issue with transferring the results back.
errno=10054 also indicates that there is a problem with the network connection.
 
Our corporate network does seem to work fine from a normal userâs perspective (i.e. working with samba shares etc.). Could there still be a performance problem and should I get in touch with our system administrator?
 
Or is there an other explanation for this behavior?
 
Is there some setting that I can make in the condor_config files that may at least mitigate these issues?

Thank you and best regards
Dennis
 
Dipl.-Ing. Dennis Neuhaus
Simulation â Lastenrechnung / Loads
<image001.png>
windwise GmbH
AugustastraÃe 47/49a
48153 MÃnster
 
Mobile: +49 151 55113651
Internet: www.windwise.eu
 
Dennis Neuhaus
QR-Code
 
windwise GmbH | Sitz/Reg. Amtsgericht MÃnster Aktenzeichen: HRB 15484 | GeschÃftsfÃhrer: Markus Becker, Benno Sandmann
 
 
 
 
 
 
 
Follow us on
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Die Information in dieser E-Mail ist vertraulich und ist ausschlieÃlich fuer den Adressaten bestimmt. Jeglicher Zugriff auf diese E-Mail durch andere Personen als den Adressaten ist untersagt. Sollten Sie nicht der fuer diese E-Mail bestimmte Adressat sein, ist Ihnen jede VerÃffentlichung, Vervielfaeltigung oder Weitergabe wie auch das Ergreifen oder Unterlassen von MaÃnahmen im Vertrauen auf erlangte Information untersagt. In dieser E-Mail enthaltene Meinungen oder Empfehlungen unterliegen den Bedingungen des jeweiligen VertrgasverhÃltnisses zwischen Absender und Adressaten.
 
The content of this e-mail message is confidential and intended solely for the use oft he addressee. Any access to this e-mail message by any other person than the addressee is prohibited. If you are not the intended recipient of this e-mail message, please note that any dissemination, copying or distribution as well as any other use of the content of this message is strictly prohibited. Sentiments as well as recommendations contained in this e-mail message are subject to the conditions between addresser and addressee.
 
 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

Join us in June at Throughput Computing 25: https://urldefense.com/v3/__https://osg-htc.org/htc25__;!!Mak6IKo!Ps8hGUeiVSndzHfiF1YOAJ3YaYA8F7e2stslVces0k8E0kcAGDsGEuD5JsgbUHioKyfS_mGFxXDdnVBn41NbbvuT1V4Ro_k$

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/