Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] condor job stays in RUN state after output transfer to remote storage
- Date: Thu, 14 Mar 2024 09:28:22 -0500
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] condor job stays in RUN state after output transfer to remote storage
On 3/10/24 11:38, Benoit Roland wrote:
Dear all,
I would like to understand why a particular user's job is staying in
the RUN state after the job's output has been transferred to the
remote storage.
Hi Ben:
The short (but not useful) answer is that the job stays in the "R"un
state until the shadow exits, even when the job has called exit(2), and
file transfer has completed. Usually, the shadow exits pretty quickly
after file transfer is done, but if something goes wrong, it might hang
on longer.
I found the following error message appearing at a constant rate in
the ShadowLog of the job, well after the ouput has been retrieved on
the remote storage:
ERROR "Error from slot1_1@gridka-2ed723aef7@c01-011-108.gridka.de:
Repeated attempts to transfer output failed for unknown reasons"
at line 585 in file
/tmp/__build/build-3k7WTP/BUILD/condor-23.5.0/src/condor_shadow.V6.1/pseudo_ops.cpp
Do you have access to the StarterLog.slotXXX on the EP? I think that
will point to the problem.
-greg