[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] condor job stays in RUN state after output transfer to remote storage
- Date: Sun, 10 Mar 2024 17:38:38 +0100
- From: Benoit Roland <benoit.roland@xxxxxxx>
- Subject: [HTCondor-users] condor job stays in RUN state after output transfer to remote storage
Dear all,
I would like to understand why a particular user's job is staying in
the RUN state after the job's output has been transferred to the
remote storage.
The part of the jdl configuration related to the output transfer is
the following:
use_oauth_services = helmholtz
should_transfer_files = YES
transfer_output_files = results
output_destination =
helmholtz+https://dcache-desy-webdav.desy.de:2880//pnfs/desy.de/punch/benoit_roland/test_alexander
where the access token required to interact with the remote storage
is provided by the Helmholtz AAI provider and renewed by the Mytoken
service [1].
When retrieving the job's output locally on the submit node with the
jdl configuration:
use_oauth_services = helmholtz
should_transfer_files = YES
transfer_output_files = results
the issue does not appear, and the job disappear from the queue
after a successful transfer of the output to the submit node.
I tried for some submissions to access the worker node on which the
job is still running (while the output has been already transferred
to the remote storage) via condor_ssh_to_job.
I didn't get a consistent behavior, getting for some trials the
message:
Welcome to slot1_1@gridka-52806f20c6@c01-017-126.gridka.de!
Your condor job is running with pid(s) 147490.
and for some others, the message:
slot1_1@gridka-52806f20c6@c01-017-126.gridka.de: Rejecting request,
because the job is finished.
I found the following error message appearing at a constant rate in
the ShadowLog of the job, well after the ouput has been retrieved on
the remote storage:
ERROR "Error from slot1_1@gridka-2ed723aef7@c01-011-108.gridka.de:
Repeated attempts to transfer output failed for unknown reasons"
at line 585 in file
/tmp/__build/build-3k7WTP/BUILD/condor-23.5.0/src/condor_shadow.V6.1/pseudo_ops.cpp
This message is pointing to the exception:
l584: //lame: at the time of this writing, EXCEPT does not want
const:ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
Â
l585: EXCEPT("%s", critical_error);
Could you please give me some hint about where to look more deeply
in order to solve this issue?
Thanks a lot in advance for your help!
Cheers,
ben
[1] https://github.com/benoitroland/C4P-HTCondor/tree/devel_rhel8