[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Download called during active transfer



Hi,

 

Weâre running 24.0.7 in schedds (24.0.3 in workers), and weâre seeing some user jobs being restarted many times due to the shadow exiting on an exception like the following:

 

12/09/25 15:28:18 (pid:3543188) (13437533.0) (3543188): ERROR "FileTransfer::Download called during active transfer!" at line 2044 in file /var/lib/condor/execute/slot1/dir_2606725/userdir/build-qRBc1D/BUILD/condor-24.0.7/src/condor_utils/file_transfer.cpp

12/09/25 15:28:18 (pid:3543188) (13437533.0) (3543188): Daemon exiting before all child processes gone; killing 3543192

 

This is matched by the following on the starter:

 

12/09/25 15:28:17 (pid:2029580) File transfer failed (status=0).

12/09/25 15:28:17 (pid:2029580) Failed to transfer files:  reason unknown.

12/09/25 15:28:17 (pid:2029580) Skipping execution of Job 13437533.0 because of setup failure.

12/09/25 15:28:18 (pid:2029580) condor_read(): Socket closed abnormally when trying to read 5 bytes from daemon at <188.185.121.235:9618>, errno=104 Connection reset by peer

12/09/25 15:28:18 (pid:2029580) Failed to receive GoAhead message from 188.185.121.235.

12/09/25 15:28:18 (pid:2029580) DoUpload: exiting at 5220

 

The jobs are retried and eventually they do succeed, or they hit the max retries number and are put on hold.

 

Iâm seeing this recently for particular schedds and users, and I donât know if itâs just caused by errors when accessing the input files of the job (AFS), but the error seems to indicate that a transfer for a file was already active when a new one was initiated and that immediately causes the shadow to exit. If thatâs really the case, how could that happen? Has somebody seen something similar?

 

Thank you!

 

Cheers,

    Antonio