| Mailing List ArchivesAuthenticated access |  | ![[Computer Systems Lab]](http://www.cs.wisc.edu/pics/csl_logo.gif)  | 
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] condor job stays in RUN state after output transfer to remote storage
- Date: Thu, 14 Mar 2024 09:28:22 -0500
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] condor job stays in RUN state after output transfer to remote storage
On 3/10/24 11:38, Benoit Roland wrote:
Dear all,
I would like to understand why a particular user's job is staying in 
the RUN state after the job's output has been transferred to the 
remote storage.
Hi Ben:
The short (but not useful) answer is that the job stays in the "R"un 
state until the shadow exits, even when the job has called exit(2), and 
file transfer has completed. Usually, the shadow exits pretty quickly 
after file transfer is done, but if something goes wrong, it might hang 
on longer.
I found the following error message appearing at a constant rate in 
the ShadowLog of the job, well after the ouput has been retrieved on 
the remote storage:
ERROR "Error from slot1_1@gridka-2ed723aef7@c01-011-108.gridka.de: 
Repeated attempts to transfer output failed for unknown reasons"
at line 585 in file 
/tmp/__build/build-3k7WTP/BUILD/condor-23.5.0/src/condor_shadow.V6.1/pseudo_ops.cpp
Do you have access to the StarterLog.slotXXX on the EP? I think that 
will point to the problem.
-greg