[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor job stays in RUN state after output transfer to remote storage



Dear all,

I would like to understand why a particular user's job is staying in the RUN state after the job's output has been transferred to the remote storage.

The part of the jdl configuration related to the output transfer is the following:

use_oauth_services = helmholtz
should_transfer_files = YES
transfer_output_files = results
output_destination = helmholtz+https://dcache-desy-webdav.desy.de:2880//pnfs/desy.de/punch/benoit_roland/test_alexander

where the access token required to interact with the remote storage is provided by the Helmholtz AAI provider and renewed by the Mytoken service [1].

When retrieving the job's output locally on the submit node with the jdl configuration:

use_oauth_services = helmholtz
should_transfer_files = YES
transfer_output_files = results

the issue does not appear, and the job disappear from the queue after a successful transfer of the output to the submit node.

I tried for some submissions to access the worker node on which the job is still running (while the output has been already transferred to the remote storage) via condor_ssh_to_job.

I didn't get a consistent behavior, getting for some trials the message:

Welcome to slot1_1@gridka-52806f20c6@c01-017-126.gridka.de!
Your condor job is running with pid(s) 147490.

and for some others, the message:

slot1_1@gridka-52806f20c6@c01-017-126.gridka.de: Rejecting request, because the job is finished.

I found the following error message appearing at a constant rate in the ShadowLog of the job, well after the ouput has been retrieved on the remote storage:

ERROR "Error from slot1_1@gridka-2ed723aef7@c01-011-108.gridka.de: Repeated attempts to transfer output failed for unknown reasons"
at line 585 in file /tmp/__build/build-3k7WTP/BUILD/condor-23.5.0/src/condor_shadow.V6.1/pseudo_ops.cpp

This message is pointing to the exception:
l584: //lame: at the time of this writing, EXCEPT does not want const:ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Â
l585: EXCEPT("%s", critical_error);

Could you please give me some hint about where to look more deeply in order to solve this issue?
Thanks a lot in advance for your help!

Cheers,
ben

[1] https://github.com/benoitroland/C4P-HTCondor/tree/devel_rhel8