On Wed, 28 Oct 2009, Steve Shaw wrote:
Thanks for the quick response Kent,
I tried the 7.2.4 release and sent 1000 python jobs in a single no-
dependency dag. Each job just created a file and exited. After
doing a condor_submit_dag on my created dag, I got 509 files back
and then it looks like my dag job got stuck and started idling
(with the 7.0.4 build, I could swear that it remained 'running' but
still had the same behavior). Looking at the lib.err file for the
dag, it had the error:
dprintf() had a fatal error in pid 8620
Can't open "bigjob.dag.dagman.out"
errno: 24 (Too many open files)
Okay, that explains your problems...
I assume that your node jobs are using a lot of different log
files. In
all DAGMans prior to 7.3.2, all of the log files are open all of the
time.
In 7.3.2, the log file reading code was changed only have a log file
open when a job that logs to that file is in the queue. However,
7.3.2
has a bug in how the log file code deals with rescue DAGs. This is
fixed
in 7.4.0, so a 7.4.0 DAGMan would fix all of your problems.
Unfortunately, 7.4.0 hasn't been released yet.
So the workaround would be to change your node jobs to use a smaller
"set"
of log files. (In fact, performance-wise the best thing is for all
node
jobs to use the same log file.) If that's really hard on your end, I
guess we could send 7.4.0 pre-release DAGMan binaries, if you tell
us what
architecture/OS you need.
(One general DAGMan note here -- in 7.3.2 and later versions of
DAGMan,
you don't have to specify a log file in your node job submit files.
If no
log file is specified, DAGMan will automatically plug in a default log
file. We think this will probably be the preferred way to do things,
since you'll automatically get a single log file, and you won't have
to
worry about "interference" if you use the same submit file in more
than
one DAG.)
Kent Wenger
Condor Team
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/