Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] DAGMAN memory
- Date: Tue, 10 Jun 2008 10:56:12 -0500 (CDT)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] DAGMAN memory
On Tue, 10 Jun 2008, Aengus McCullough wrote:
I have been running large DAGMAN job collections comprised of 500 -
1500 individual jobs running concurrently. On initial runs of the job
I noticed that several of these jobs were failing. I have managed to
resolve the issue by restricting the maximum number of concurrent jobs
to 80 and setting the maximum number of retries to 3. I understand
that this issue is a result of DAGMAN memory limitations; can any one
confirm this? Is this a limitation on the central manager or elsewhere?
Is there any way to resolve this issue aside from restricting the
maximum number of concurrent jobs?
Hmm, I'd be really surprised if this problem was a result of memory
limitations in DAGMan itself -- other users are successfully running
DAGs with several hundred thousand nodes. It could be the result of some
other resource limitation, though.
When you say that jobs are failing, by "job" you mean an individual node
job in the DAG, right? (As opposed to DAGMan itself crashing.) If that
is the case, you need to look at the user log(s) from those jobs, and any
other info you may have (stdout, stderr, etc.). When a job is submitted
by DAGMan, there is *very* little difference between than and just
submitting a job by hand. So the issue is exactly what is causing the
jobs to fail -- once you narrow that down, you can attack the problem.
Kent Wenger
Condor Team