[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAGMAN memory



On Tue, 10 Jun 2008, Aengus McCullough wrote:

I have been running large DAGMAN job collections comprised of 500 - 1500 individual jobs running concurrently. On initial runs of the job I noticed that several of these jobs were failing. I have managed to resolve the issue by restricting the maximum number of concurrent jobs to 80 and setting the maximum number of retries to 3. I understand that this issue is a result of DAGMAN memory limitations; can any one confirm this? Is this a limitation on the central manager or elsewhere? Is there any way to resolve this issue aside from restricting the maximum number of concurrent jobs?
Hmm, I'd be really surprised if this problem was a result of memory 
limitations in DAGMan itself -- other users are successfully running
DAGs with several hundred thousand nodes.  It could be the result of some 
other resource limitation, though.
When you say that jobs are failing, by "job" you mean an individual node 
job in the DAG, right?  (As opposed to DAGMan itself crashing.)  If that 
is the case, you need to look at the user log(s) from those jobs, and any 
other info you may have (stdout, stderr, etc.).  When a job is submitted 
by DAGMan, there is *very* little difference between than and just 
submitting a job by hand.  So the issue is exactly what is causing the 
jobs to fail -- once you narrow that down, you can attack the problem.
Kent Wenger
Condor Team