[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Too many popen() calls in DAGMan ?



On 9/6/06, Masakatsu Ito <m-ito@xxxxxxxxxxxxxx> wrote:
Dear all,

I'm using DAGMan to perform a set of simulations
with different parameters. DAGMan has worked well
with a small set of simulations, but when I try
to perform a larger set, it stopped with an error
message in its .dagman.out file, like :

>9/6 00:45:28 Submitting Condor Job f1s5v13t ...
>9/6 00:45:28 submitting: condor_submit  -a 'dag_node_name = f1s5v13t' -a '+DAGMa
>nJobID = 17168' -a 'submit_event_notes = DAG Node: f1s5v13t' -a 'currname = fram
>e1' -a 'prevname = frame0' -a 'ndx = group.ndx' -a '+DAGParentNodeNames = "f0s5v
>13"' SAMPLE5/VDW13/tpbconv.submit 2>&1
>9/6 00:45:28 condor_submit  -a 'dag_node_name = f1s5v13t' -a '+DAGManJobID = 171
>68' -a 'submit_event_notes = DAG Node: f1s5v13t' -a 'currname = frame1' -a 'prev
>name = frame0' -a 'ndx = group.ndx' -a '+DAGParentNodeNames = "f0s5v13"' SAMPLE5
>/VDW13/tpbconv.submit 2>&1: popen() in submit_try failed!
>9/6 00:45:28 ERROR: submit attempt failed
>
>

So I guess my simulations make DAGMan create
too many processes by invoking popen().

I would think it more likely that the processes created by the shadows
for jobs running (guessing you get a lot of the pool sometimes - lucky
you!) is eating up some user/box process limit.
What is your max process limit on your machine?

This is a guess though. I don't know enough about DAGman to know if it
cases a lot of process creation internally.

Could anybody please tell me if this size of simulations
can exceed the limit of DAGMan ? Or the older version of
DAGMan in CONDR 6.7.14 can easily create more processes
that the latest version ? (Actually this older version
is installed in our system.)

There are plently of people using DAGMan to submit thousands of jobs
(though they tend to make sure they only have a few hundred jobs in
the queue at any one time for performance).

As to the version you might want to take a look at the BugFixes in
http://www.cs.wisc.edu/condor/manual/v6.7/8_3Development_Release.html
to see if there is any thing about DAGMan you should know

Matt