dear all,
I am trying to run thousands of jobs on a large condor grid with a single
network storage. We noticed that upon increasing the number of jobs, the systems
performance is reduced. We discovered that the network drive condor is trying to
copy the files on was overwhelmed by the number of simultaneous connections and
when the device was busy the job was dropped and restarted somewhere else (we
using vanilla universe under windows 7).
I am trying to implement robocopy in my fortran source code .exe simulation
that needs to run on condor so that by using a system call to try sending the
files to the storage space this way instead. However this does not appear to
work on the condor nodes. I did various checks and it works fine on physical
machines.
Any ideas?
Cheers
Antonis |