Ian,
This is a well-known issue with NFS file locking -- it's consistently
unreliable, and as a result we simply can't support DAGMan's use when
your userlogs are written to NFS.
The good news is that in 99% of cases, it's easy to specify that they
be written to a local directory instead (even if all your job i/o is
being done via NFS -- the userlogs are written on the submit side), and
when you do, the problem will go away.
Let us know if that doesn't solve things for you. Thanks,
-Peter
On Feb 4, 2005, at 6:20 AM, Dr Ian C. Smith wrote:
Dear All,
We've recently been using DAGman to get long running
jobs working on our pool using the DAG recursion idea.
The submit host is a Solaris 9 box and all of the
execution PCs are Win XP/Intel. While the majority
of jobs work fine and run to completion, occasionally
some die. This error message appears in file.dagman.out:
2/4 02:33:39 Event: ULOG_EXECUTE for Condor Job A (13506.0.0)
2/4 02:33:49 Event: ULOG_IMAGE_SIZE for Condor Job A (13506.0.0)
2/4 02:53:47 Event: ULOG_IMAGE_SIZE for Condor Job A (13506.0.0)
2/4 03:13:49 Event: ULOG_IMAGE_SIZE for Condor Job A (13506.0.0)
2/4 03:33:47 Event: ULOG_IMAGE_SIZE for Condor Job A (13506.0.0)
2/4 04:33:56 read error on log
/ffs/mat_alanca/condor/jobs/cl1/cmi600/mdr.log 2/4 04:33:56 ERROR:
failure to read job log
A log event may be corrupt. DAGMan will skip the event and
try to
continue, but information may have been lost. If DAGMan exits
unfinished, but reports no failed jobs, re-submit the rescue
file
to complete the DAG
The log files are stored on an NFS mounted filesystem which I suppose
could cause problems but I can't understand why this would affect some
jobs and not others running concurrently. The actually dagaman process
still seems to be running happily on the submit host.
As a workaround can condor be set up to resubmit the rescue DAG
automatically.
yours perplexed,
-ian.
-----------------------------------
Dr Ian C. Smith,
e-Science team,
University of Liverpool,
Computing Services Department.
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users
--
Peter Couvares University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
pfc@xxxxxxxxxxx 1210 W. Dayton St. Rm #4241
(608) 265-8936 Madison, WI 53706-1685
|