| Mailing List ArchivesAuthenticated access |  | ![[Computer Systems Lab]](http://www.cs.wisc.edu/pics/csl_logo.gif)  | 
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Hardening against NFS failure
- Date: Mon, 27 Feb 2017 17:42:27 +0000
- From: Stephen Jones <sjones@xxxxxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Hardening against NFS failure
Hi Justin,
On 02/27/2017 04:36 PM, Justin Fisher wrote:
The input file is on an NFS share and points to thousands of files on 
the same NFS share. The output is written to another directory on the 
same NFS share.
I find that one of my machines is flaky and the NFS keeps dropping out.
Let's see if I have this straight. You have a job running which reads 
data from file(s) on an NFS share. NFS is flaky and quits, so
a) the jobs that are running can't read or write and crash out.
b) the jobs that are queued get run, and they can't even start to read 
or write.
Is that it?
Is there a way to run divert these files onto a machine that is alive?
It doesn't sound like HTCondor is doing anything wrong; just NFS. I 
don't know what you mean by "divert these files". Do you mean the files 
read by the job, the files written by the job or both? And do you mean 
that the job should look elsewhere for its data (or to write data) if it 
fails to find (or write) data on the original NFS share? If so, this is 
a "job level" action. For a job that is just starting to run, it could 
sense whether the file it expects is available. If it is not, the job 
could look in another place to see if it is there. That would mean some 
kind of change to the logic of the job, or perhaps a wrapper around the 
job. For jobs that are already running: that's harder. If the data is 
snatched from beneath a running job when NFS fails, then the results are 
rather unpredictable.
Obviously, I need to find out why this machine keeps crashing NFS, but 
I'm wondering if there is a workaround while I do this?
Many years ago, on another batch system, I saw that NFS "locked up" 
after running busy jobs for a long time. The answer then was to drain 
the nodes every few days, and reboot them. As long as we didn't exceed 
(say) three days uptime for a node, it would not break. We'd have to do 
that continuously for weeks to get the jobs through. Anything to keep 
the show on the road. There was a bloke called Trond /Myklebust/ who did 
a lot to try to make NFS better, but I don't know if he ever made it 
absolutely bomb proof.
Cheers,
Ste
--
Steve Jones                             sjones@xxxxxxxxxxxxxxxx
Grid System Administrator               office: 220
High Energy Physics Division            tel (int): 43396
Oliver Lodge Laboratory                 tel (ext): +44 (0)151 794 3396
University of Liverpool                 http://www.liv.ac.uk/physics/hep/