Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor_transfer_data problem on major version switch
- Date: Fri, 26 Oct 2012 10:37:06 +0200
- From: Max Fischer <mfischer@xxxxxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] condor_transfer_data problem on major version switch
Hi all,
we have recently begun testing remote features in our glidein/condor
pool to allow people from our institute to use condor from any
authorised device (laptops, heterogenous work pools, etc.) without
having to worry about any permanent condor infrastructure there.
Basically we want to supply a drastically cut-down condor installation
via a shared disk to supply only the commands necessary for interfacing
with the remote daemons - as we are still in the testing phase, we are
using a full condor suite (i.e. all bin, sbin, libraries, etc.) at the
moment, though.
Now, while submitting (condor_submit -remote <remote schedd> <jdl>) and
managing (condor_rm, condor_q, ...) works fine, we experience a strange
bug with file transfer when our resources/glideins are running on 7.6.X
(tested with 7.6.10 and 7.6.7) while the user condor package is 7.8.X.
When trying to transfer the output back from our dedicated schedd,
condor_transfer_data will request transfer of the "_condor_stderr" and
"_condor_stdout" files which do not exist, causing the process to exit
with an error [1]; this results in only the first job data being fetched
(the process exits afterwards) and it will also leave the job alive in
both the queue and spool, slowly polluting our schedd node with
leftovers unless manually cleaned.
As far as I understand, these files are stand-ins on the remote
schedd/workers for the actual Out and Err files (i.e. "_condor_stderr"
would get remapped to "path/to/$(Cluster).$(Process).err" after file
transfer to the user), yet it appears that both the transfer
worker->schedd AND schedd->user attempt to map them back (thus failing
on the second iteration). On the schedd, the files are already stored as
"/spool/<cluster.process folder>/$(Cluster).$(Process).err".
Bottom line is, condor_transfer_data worked ONLY if both the user AND
the glideins/workers were running on the same (major) version (tested
with 7.6.10 and 7.8.4). Seeing how all other condor functions used
worked flawlessly even across major versions, we are not certain if the
version mismatch is the actual cause or if there is another reason for
it; the condor changelog did not mention a change to the transfer_data
process.
Our setup makes it very likely that we might have workers/resources
running on different condor major versions, so knowing whether we also
have to prepare remote submit packages matching all versions in use or
have some leeway there, especially in light of a smooth workflow for
users, would be very helpful.
Best regards,
Max
[1] $ condor_transfer_data -name <remote schedd> 391.0
DCSchedd::receiveJobSandbox:7003:File transfer failed for target job
391.0: SCHEDD at 129.13.133.37 failed to send file(s) to
<129.13.133.12:60262>: error reading from
/data/srv/condor/current/condor_local/spool/391/0/cluster391.proc0.subproc0/391.0.pin.py.stderr:
(errno 2) No such file or directory; TOOL failed to receive file(s) from
<129.13.133.37:9615>
AUTHENTICATE:1004:Failed to authenticate using FS
ERROR: Failed to spool job files.