I am running Condor on a private network with one machine that acts as submit/master/scheduler for a number of worker machines. All the machines are CentOS7.
I am having a problem understanding/troubleshooting some Condor permission-related issues that can be briefly described as follows:
- In the course of my Condor job - which is a parameter estimation process that runs the same simulation several times on each Condor worker - I need to periodically (i) evaluate the output of each of the Condor workers and (ii) as a result of this evaluation update an input file on each of the Condor workers.
- The way that I am executing this is (i) each of the Condor workers periodically pushes its most recent output file back to a designated folder on the submit node for collection and evaluation, and (ii) each Condor worker, before performing another simulation, pulls the updated contents of that submit-node folder to its working directory.
- I have tried these during-job file transfers with both rsync and scp (in both cases calling them using the Python subprocess module):
- The worker machines can successfully use scp to send files back to the submit machine.
- However, I cannot use scp to pull files from submit machine to the worker while the job is running; this fails with a 'permission denied' error.
- I can sometimes use rsync to pull files from the submit machine to the worker, but sometimes it fails with a connection timeout.
- Rsync seems to work better on a virtual cluster of Linux machines (set up a single desktop using Virtual Box) than it does on the identically-configured network of physical machines with cable connections.
To clarify: each Condor job consists of several sequential simulations (i.e., NOT one Condor job = one simulation), with each simulation requiring an updated input file that depends on previous simulations that have been performed on all Condor workers. So I cannot package all the required input files and send them with each worker upon condor_submit. Neither can I wait for a Condor job to finish before collecting the output.
Some additional information:ÂI have installed ssh keys across the network such that each of the worker machine can passwordless ssh back to the submit machine and vice versa. In addition, I am running Condor jobs as the user and can successfully CONDOR_SSH_TO_JOB and manually perform the file transfers between the worker and the submit using rsync or scp. But I of course need to automate this.
Happy to provide more information. Thanks for any help.
Wes Zell