[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Download called during active transfer



It looks to me like the failure of the input transfer along with the large number of input files is resulting in the Starter trying to begin output transfer (for failure info) before the Shadow has finished cleaning up the child process it was using for input transfer.   

This causes the message about trying to do a transfer while a transfer is in progress, but that is a consequence of the earlier failure.  

This is a race condition that we need to fix.   I know that we fixed at least one similar race a few days ago that will be released in early January.  I'm not sure if the race we fixed is exactly what you are seeing here, but it is similar.

-tj


From: Antonio Delgado Peris <Antonio.Delgado.Peris@xxxxxxx>
Sent: Tuesday, December 9, 2025 1:49 PM
To: John M Knoeller <johnkn@xxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: RE: Download called during active transfer

Thanks for the answer!

 

Well, there are many messages before, because this was the Nth attempt, but they all look the same, no different error messages. It may well be just a problem accessing the storage, but the error message seemed confusing to me.

 

Now, I just noticed that this job in particular eventually succeeded not much later, and if I read correctly the XferStatsLog info, the job had 40 thousand input files:

 

12/09/25 15:29:58 (pid:3543472) (13437533.0) (3543468): File Transfer Upload: JobId: 13437533.0 files: 40542 bytes: 277927438 seconds: 43.09 dest: 188.185.196.91 rto: 201000 ato: 40000 snd_mss: 1448 rcv_mss: 755 unacked: 0 sacked: 0 lost: 0 retrans: 0 fackets: 0 pmtu: 1500 rcv_ssthresh: 31820 rtt: 324 snd_ssthresh: 136 snd_cwnd: 195 advmss: 1448 reordering: 68 rcv_rtt: 0 rcv_space: 14600 total_retrans: 55

 

[...]

 

12/09/25 15:40:07 (pid:3545504) (13437533.0) (3543468): File Transfer Download: JobId: 13437533.0 files: 9 bytes: 133301 seconds: 0.04 dest: 188.185.196.91 rto: 201000 ato: 40000 snd_mss: 1448 rcv_mss: 1448 unacked: 1 sacked: 0 lost: 0 retrans: 0 fackets: 0 pmtu: 1500 rcv_ssthresh: 66490 rtt: 225 snd_ssthresh: 2147483647 snd_cwnd: 10 advmss: 1448 reordering: 3 rcv_rtt: 300 rcv_space: 72117 total_retrans: 0

 

 

There were also many such jobs from the same user at the same time. And they probably access similar or the same files. That amount of file transfers is probably a good candidate for one failure or another...

 

Cheers,

   Antonio

 

 

From: John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: Tuesday, December 9, 2025 8:29 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Antonio Delgado Peris <Antonio.Delgado.Peris@xxxxxxx>
Subject: Re: Download called during active transfer

 

These messages look to me like the consequence of an earlier failure.  Is there no indication of a problem in either log before 15:28:17?

 

-tj

 


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Antonio Delgado Peris via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, December 9, 2025 12:01 PM
To: HTCondor-Users Mail List <
htcondor-users@xxxxxxxxxxx>
Cc: Antonio Delgado Peris <
Antonio.Delgado.Peris@xxxxxxx>
Subject: [HTCondor-users] Download called during active transfer

 

Hi,

 

We’re running 24.0.7 in schedds (24.0.3 in workers), and we’re seeing some user jobs being restarted many times due to the shadow exiting on an exception like the following:

 

12/09/25 15:28:18 (pid:3543188) (13437533.0) (3543188): ERROR "FileTransfer::Download called during active transfer!" at line 2044 in file /var/lib/condor/execute/slot1/dir_2606725/userdir/build-qRBc1D/BUILD/condor-24.0.7/src/condor_utils/file_transfer.cpp

12/09/25 15:28:18 (pid:3543188) (13437533.0) (3543188): Daemon exiting before all child processes gone; killing 3543192

 

This is matched by the following on the starter:

 

12/09/25 15:28:17 (pid:2029580) File transfer failed (status=0).

12/09/25 15:28:17 (pid:2029580) Failed to transfer files:  reason unknown.

12/09/25 15:28:17 (pid:2029580) Skipping execution of Job 13437533.0 because of setup failure.

12/09/25 15:28:18 (pid:2029580) condor_read(): Socket closed abnormally when trying to read 5 bytes from daemon at <188.185.121.235:9618>, errno=104 Connection reset by peer

12/09/25 15:28:18 (pid:2029580) Failed to receive GoAhead message from 188.185.121.235.

12/09/25 15:28:18 (pid:2029580) DoUpload: exiting at 5220

 

The jobs are retried and eventually they do succeed, or they hit the max retries number and are put on hold.

 

I’m seeing this recently for particular schedds and users, and I don’t know if it’s just caused by errors when accessing the input files of the job (AFS), but the error seems to indicate that a transfer for a file was already active when a new one was initiated and that immediately causes the shadow to exit. If that’s really the case, how could that happen? Has somebody seen something similar?

 

Thank you!

 

Cheers,

    Antonio