Hello,
we have condor installed at the campus (around 3500
machines available) and I am trying to submit around 3000
jobs per instance. I have installed condor on my office
machine and it acts as a server which administrates the
submission and orchestrates the whole thing. The problem is
that my hard disk is not fast enough to keep a track of more
than 400-500 machines (I have checked the disk queue length
while condor is running and it is rather large). We have a
network storage scheme which is extremely fast. I was
wondering how can I store the “spool” file that keeps the
checkpoints for every job in my network space instead of my
local machine. I have benchmarked the network storage
location and it is fast enough to do the job. The problem is
that I don’t know how to make my machine to use the network
for checkpoint storage instead of the local one in my
computer.
I have seen the “checkpoint server” option but I am not
sure if there is any other simpler method to do that.
Any ideas?
Thanks