[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Jobs results not transferring back - errno=10054



Hello,

 

I am using HTCondor LTS 24.0.7 on several windows 10 / 11 machines.

When submitting a small number of jobs (~20), these finish without any problems. Once I increase the number of jobs (100+) the first few of these jobs will still complete successfully. After that, it seems that the whole process gets stuck in an infinite loop, with running jobs switching between idle and running state and none of these jobs ever completing. All jobs are pretty much the same, only input parameters change. Each job should usually take roughly a minute to finish.

 

Here is an excerpt from one starter log (CPU Slot1) of one of the execution machines (which in this case also happens to be the central manager and is also the machine that the jobs have been submitted from).

I have marked the lines that look suspicious to me.

 

06/04/25 10:09:14 (pid:7344) ******************************************************

06/04/25 10:09:14 (pid:7344) ** condor_starter (CONDOR_STARTER) STARTING UP

06/04/25 10:09:14 (pid:7344) ** C:\condor\bin\condor_starter.exe

06/04/25 10:09:14 (pid:7344) ** SubsystemInfo: name=STARTER type=STARTER(7) class=DAEMON(1)

06/04/25 10:09:14 (pid:7344) ** Configuration: subsystem:STARTER local:slot_type_1 class:DAEMON

06/04/25 10:09:14 (pid:7344) ** $CondorVersion: 24.0.7 2025-04-22 BuildID: 803687 GitSHA: 51b71b5c $

06/04/25 10:09:14 (pid:7344) ** $CondorPlatform: x86_64_Windows10 $

06/04/25 10:09:14 (pid:7344) ** PID = 7344

06/04/25 10:09:14 (pid:7344) ** Log last touched 6/4 10:08:52

06/04/25 10:09:14 (pid:7344) ******************************************************

06/04/25 10:09:14 (pid:7344) Using config source: C:\condor\condor_config

06/04/25 10:09:14 (pid:7344) Using local config sources:

06/04/25 10:09:14 (pid:7344)    C:\condor\condor_config.local

06/04/25 10:09:14 (pid:7344)    C:\condor\condor_config.local.credd

06/04/25 10:09:14 (pid:7344) config Macros = 78, Sorted = 76, StringBytes = 1876, TablesBytes = 2864

06/04/25 10:09:14 (pid:7344) CLASSAD_CACHING is OFF

06/04/25 10:09:14 (pid:7344) Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS

06/04/25 10:09:14 (pid:7344) SharedPortEndpoint: listener already created.

06/04/25 10:09:14 (pid:7344) DaemonCore: command socket at <192.168.101.45:9618?addrs=192.168.101.45-9618&alias=<machine>.<domain.ext>&noUDP&sock=slot1_17160_b616_5544>

06/04/25 10:09:14 (pid:7344) DaemonCore: private command socket at <192.168.101.45:9618?addrs=192.168.101.45-9618&alias=<machine>.<domain.ext>&noUDP&sock=slot1_17160_b616_5544>

06/04/25 10:09:14 (pid:7344) Communicating with shadow <192.168.101.45:9618?addrs=192.168.101.45-9618&alias=<machine>.<domain.ext>&noUDP&sock=shadow_13780_5fdc_40858>

06/04/25 10:09:14 (pid:7344) Submitting machine is "<machine>.<domain.ext>"

06/04/25 10:09:14 (pid:7344) setting the orig job name in starter

06/04/25 10:09:14 (pid:7344) setting the orig job iwd in starter

06/04/25 10:09:14 (pid:7344) Chirp config summary: IO false, Updates false, Delayed updates true.

06/04/25 10:09:14 (pid:7344) Initialized IO Proxy.

06/04/25 10:09:14 (pid:7344) Setting resource limits not implemented!

06/04/25 10:09:14 (pid:7344) Set filetransfer runtime ads to C:\condor\execute\dir_7344\.job.ad and C:\condor\execute\dir_7344\.machine.ad.

06/04/25 10:09:14 (pid:7344) Not entering transfer queue because sandbox (2137310) is too small (<= 104857600).

06/04/25 10:09:16 (pid:7344) File transfer completed successfully.

06/04/25 10:09:16 (pid:7344) Job 635.13 set to execute immediately

06/04/25 10:09:16 (pid:7344) Starting a VANILLA universe job with ID: 635.13

06/04/25 10:09:16 (pid:7344) IWD: C:\condor\execute\dir_7344

06/04/25 10:09:16 (pid:7344) Input file: C:\condor\execute\dir_7344\Fa0_V1600_00.inf

06/04/25 10:09:16 (pid:7344) Output file: C:\condor\execute\dir_7344\_condor_stdout

06/04/25 10:09:16 (pid:7344) Error file: C:\condor\execute\dir_7344\_condor_stderr

06/04/25 10:09:17 (pid:7344) Renice expr "10" evaluated to 10

06/04/25 10:09:17 (pid:7344) Running job as user dennis.neuhaus

06/04/25 10:09:17 (pid:7344) About to exec C:\condor\execute\dir_7344\Local_TS_Sim.01.bat Fa0_V1600_00.inf

06/04/25 10:09:17 (pid:7344) Executable is a batch file, running: "C:\WINDOWS\system32\cmd.exe" /Q /C "C:\condor\execute\dir_7344\Local_TS_Sim.01.bat" Fa0_V1600_00.inf

06/04/25 10:09:17 (pid:7344) Create_Process succeeded, pid=15824

06/04/25 10:10:20 (pid:7344) Process exited, pid=15824, status=0

06/04/25 10:10:20 (pid:7344) Not entering transfer queue because sandbox (24504674) is too small (<= 104857600).

06/04/25 10:10:21 (pid:7344) condor_read(): Socket closed abnormally when trying to read 5 bytes from daemon at <192.168.101.45:9618>, errno=10054

06/04/25 10:10:21 (pid:7344) DoUpload: STARTER at 192.168.101.45 failed to send file(s) to <192.168.101.45:9618>

06/04/25 10:10:21 (pid:7344) condor_write(): Socket closed when trying to write 163 bytes to <192.168.101.45:63778>, fd is 1996, errno=10054

06/04/25 10:10:21 (pid:7344) Buf::write(): condor_write() failed

06/04/25 10:10:21 (pid:7344) i/o error result is 0, errno is 0 (No error)

06/04/25 10:10:21 (pid:7344) Lost connection to shadow, last activity was 1 secs ago, waiting 2399 secs for reconnect

06/04/25 10:10:21 (pid:7344) File transfer failed, forcing disconnect.

06/04/25 10:10:21 (pid:7344) Returning from Starter::JobReaper()

06/04/25 10:10:21 (pid:7344) Result of "unregister_family" operation from ProcD: ERROR: No family with the given PID is registered

06/04/25 10:10:21 (pid:7344) error unregistering pid 15824 with the procd

06/04/25 10:10:21 (pid:7344) Got SIGTERM. Performing graceful shutdown.

06/04/25 10:10:21 (pid:7344) ShutdownGraceful all jobs.

06/04/25 10:10:21 (pid:7344) RPC error: disconnected from shadow

06/04/25 10:10:21 (pid:7344) RPC error: disconnected from shadow

06/04/25 10:10:21 (pid:7344) Failed to send job exit status to shadow

06/04/25 10:10:21 (pid:7344) JobExit() failed, waiting for job lease to expire or for a reconnect attempt

06/04/25 10:10:23 (pid:7344) Got SIGTERM, but we've already started graceful shutdown.  Ignoring.

06/04/25 10:10:51 (pid:7344) Got SIGQUIT.  Performing fast shutdown.

06/04/25 10:10:51 (pid:7344) ShutdownFast all jobs.

06/04/25 10:10:51 (pid:7344) RPC error: disconnected from shadow

06/04/25 10:10:51 (pid:7344) Failed to send job exit status to shadow

06/04/25 10:10:51 (pid:7344) All jobs have exited... starter exiting

06/04/25 10:10:51 (pid:7344) **** condor_starter (condor_STARTER) pid 7344 EXITING WITH STATUS 0

 

I can see that the jobs are actually running (by looking into the execute folder) and they produce the expected output (files). The log reveals that the process takes about a minute – as expected. There seems to be an issue with transferring the results back.

errno=10054 also indicates that there is a problem with the network connection.

 

Our corporate network does seem to work fine from a normal user’s perspective (i.e. working with samba shares etc.). Could there still be a performance problem and should I get in touch with our system administrator?

 

Or is there an other explanation for this behavior?

 

Is there some setting that I can make in the condor_config files that may at least mitigate these issues?


Thank you and best regards

Dennis

 

Dipl.-Ing. Dennis Neuhaus

Simulation – Lastenrechnung / Loads

windwise GmbH

Augustastraße 47/49a

48153 Münster

 

Mobile: +49 151 55113651

E-Mail: dennis.neuhaus@xxxxxxxxxxx

Internet: www.windwise.eu

 

Dennis Neuhaus
QR-Code

Ein Bild, das Text, Mann, Person enthält.

Automatisch generierte Beschreibung

 

windwise GmbH | Sitz/Reg. Amtsgericht Münster Aktenzeichen: HRB 15484 | Geschäftsführer: Markus Becker, Benno Sandmann

 

 

 

 

 

 

 

Follow us on

 

 

 

 

 

 

 

 

 

Ein Bild, das Text enthält.

Automatisch generierte Beschreibung

 

 

 

 

 

 

 

 

Die Information in dieser E-Mail ist vertraulich und ist ausschließlich fuer den Adressaten bestimmt. Jeglicher Zugriff auf diese E-Mail durch andere Personen als den Adressaten ist untersagt. Sollten Sie nicht der fuer diese E-Mail bestimmte Adressat sein, ist Ihnen jede Veröffentlichung, Vervielfaeltigung oder Weitergabe wie auch das Ergreifen oder Unterlassen von Maßnahmen im Vertrauen auf erlangte Information untersagt. In dieser E-Mail enthaltene Meinungen oder Empfehlungen unterliegen den Bedingungen des jeweiligen Vertrgasverhältnisses zwischen Absender und Adressaten.

 

The content of this e-mail message is confidential and intended solely for the use oft he addressee. Any access to this e-mail message by any other person than the addressee is prohibited. If you are not the intended recipient of this e-mail message, please note that any dissemination, copying or distribution as well as any other use of the content of this message is strictly prohibited. Sentiments as well as recommendations contained in this e-mail message are subject to the conditions between addresser and addressee.