My current set up is a remote machine submitting jobs
to a central manger where the jobs are sent to worker
nodes. Recently the remote machines was upgraded from
condor 8.0.6 to condor version 8.1.5. With version 8.1.5
the jobs submitted by the remote machine show up on the
central manager as holding for a few seconds, for
example:
113423.0 apf
5/23 20:42 Spooling input data files
113423.1 apf
5/23 20:42 Spooling input data files
113423.2 apf
5/23 20:42 Spooling input data files
113423.3 apf
5/23 20:42 Spooling input data files
113423.4 apf
5/23 20:42 Spooling input data files
113423.5 apf
5/23 20:42 Spooling input data files
After a few seconds the jobs are removed. I can see
corresponding error messages on the remote submitter:
DCSchedd::spoolJobFiles:7002:File
transfer failed for target job 113423.0: Failed to
receive GoAhead message from <central manager's
IP>.
The central manager is running condor version
8.0.3. Is there a configuration variable hidden
somewhere that may be causing this issue? Is this
something that an upgrade to a later stable condor
version (on the side of the central manager) would
likely solve?
Best Regards,
-Frank
--
----------
Frank Berghaus
University of Victoria
Research Associate
Physics & Astronomy
UVic Phone: +1 (250) 721-7741
UVic Office: Elliot 212