[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] New NFS warning with condor 6.8.1



Steve--

If memory serves me, Fermilab has a better than average NFS setup, right? The warning is more applicable to simpler NFS setups, which we have seen fail all too frequently.

-alain

At 02:54 PM 9/28/2006 -0500, Nick LeRoy wrote:
On Thu September 28 2006 2:38 pm, Steven Timm wrote:
> I put condor 6.8.1 on my first few test nodes and submitted the same
> test vanilla universe job that I always do for testing.
>
> [timm@fnpcg ~]$ condor_submit recon1_1.run
> Submitting job(s)
> WARNING: Log file /home/timm/recon1.log.47070.0 is on NFS.
> This could cause log file corruption and is _not_ recommended.
> .
> Logging submit event(s).
> 1 job(s) submitted to cluster 47070.
>
>
> The log file in question is indeed on nfs, but it has been on nfs
> throughout the whole life of my cluster and I don't see why we
> are just now getting warnings about this.  There haven't been problems
> up until now.

This isn't a new problem, just a new warning about an old problem.

File locking on NFS is inherently unreliable.  We've seen enough cases of NFS
based job logs getting corrupted (from multiple processes updating the log
file) that we decided to add the warning.  I suspect that the risk of such
corruption is reduced if all writers are on the same machine, possibly even
eliminated, but I don't know for certain.  In particular, corrupted job logs
tend to make DAGMan very unhappy.

Ultimately, we'd like to implement a more advanced locking mechanism (using a
separate lock file), but we haven't had time to add this yet.

-Nick