[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor DAG spinning



You can also increase your ulimit on your server, or limit the number of DAGMAN jobs. I do both on my system. I increased the ulimit and set

DAGMAN_MAX_JOBS_IDLE = # of cores in condor cluster
DAGMAN_MAX_JOBS_SUBMITTED = 2 x # of cores in condor cluster

This keeps the queue to a reasonable size and limits the number of open file descriptors.

Sam

On Oct 28, 2009, at 1:07 PM, R. Kent Wenger wrote:

On Wed, 28 Oct 2009, Steve Shaw wrote:

Thanks for the quick response Kent,

I tried the 7.2.4 release and sent 1000 python jobs in a single no- dependency dag. Each job just created a file and exited. After doing a condor_submit_dag on my created dag, I got 509 files back and then it looks like my dag job got stuck and started idling (with the 7.0.4 build, I could swear that it remained 'running' but still had the same behavior). Looking at the lib.err file for the dag, it had the error:

dprintf() had a fatal error in pid 8620
Can't open "bigjob.dag.dagman.out"
errno: 24 (Too many open files)

Okay, that explains your problems...

I assume that your node jobs are using a lot of different log files. In all DAGMans prior to 7.3.2, all of the log files are open all of the time.
In 7.3.2, the log file reading code was changed only have a log file
open when a job that logs to that file is in the queue. However, 7.3.2 has a bug in how the log file code deals with rescue DAGs. This is fixed
in 7.4.0, so a 7.4.0 DAGMan would fix all of your problems.
Unfortunately, 7.4.0 hasn't been released yet.

So the workaround would be to change your node jobs to use a smaller "set" of log files. (In fact, performance-wise the best thing is for all node
jobs to use the same log file.)  If that's really hard on your end, I
guess we could send 7.4.0 pre-release DAGMan binaries, if you tell us what
architecture/OS you need.

(One general DAGMan note here -- in 7.3.2 and later versions of DAGMan, you don't have to specify a log file in your node job submit files. If no
log file is specified, DAGMan will automatically plug in a default log
file.  We think this will probably be the preferred way to do things,
since you'll automatically get a single log file, and you won't have to worry about "interference" if you use the same submit file in more than
one DAG.)

Kent Wenger
Condor Team
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/