Hello,
we have condor installed at the campus (around 3500 machines available) and
I am trying to submit around 3000 jobs per instance. I have installed condor on
my office machine and it acts as a server which administrates the submission and
orchestrates the whole thing. The problem is that my hard disk is not fast
enough to keep a track of more than 400-500 machines (I have checked the disk
queue length while condor is running and it is rather large). We have a network
storage scheme which is extremely fast. I was wondering how can I store the
“spool” file that keeps the checkpoints for every job in my network space
instead of my local machine. I have benchmarked the network storage location and
it is fast enough to do the job. The problem is that I don’t know how to make my
machine to use the network for checkpoint storage instead of the local one in my
computer.
I have seen the “checkpoint server” option but I am not sure if there is
any other simpler method to do that.
Any ideas?
Thanks |