Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Dagman Newbie Questions
- Date: Fri, 10 Oct 2008 16:14:44 +0200
- From: Horvátth Szabolcs <szabolcs@xxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Dagman Newbie Questions
Hi Jeremy,
Advance warning: I'm a Condor newbie. I’ve been tasked with evaluating
Condor as a queueing platform for CG & Animation production at our
facility.
Please accept my most sincere condolences. :) While render management
with Condor is certainly possible (and fun) it requires a different
mind-set compared to using Alfred, Deadline, Rush etc.
If I understand what’s going on, the dagman job appears to be simply a
job that submits other jobs, and the downstream jobs are not even
submitted until their prerequisite jobs have run. Subsequent runs can
only be run again with the .rescue file. Is this correct?
Yes. DAGMan is just a simple condor job that submits child jobs when the
dependencies allow and executes pre/post scripts. To re-run a branch or
even a single job of the DAG you have to resubmit it, either using the
rescue dag functionality or script your own command.
This has consequences for us because in our business, deadlines are
critical and resource utilization must be maximized. So progressive
estimates of completion and remaining work are necessary. We need all
nodes of the entire DAG to be present in the system to estimate
resource use, even though many of the nodes may be blocked waiting for
prerequisite nodes. Jobs that submit other jobs are a nasty surprise
for our resource managers.
While I understand why you find it a limitation, in fact its a great
thing for queue load and schedule balancing. When shot TDs submit a
hundred render layers a night, each consisting a few hundred jobs it
could easily choke the scheduler.
If you need information about the whole dag you still have multiple options:
- Add custom attributes to the dagman job that store how many tasks are
in the whole dag by task type . (ribgen / mi file gen / render /
composite / whatever)
- Make your job progress window parse the DAG files and build the GUI
based on that data instead of the queried or quill DB data. Its a simple
way to get the hierarchical information.
Also re-running of any node and its dependent nodes is fairly common
and is often done many times during pipeline troubleshooting—we don’t
want to have to re-submit the entire DAG several times in separate
runs because there may be long-running nodes in the DAG we want to
continue in parallel while we’re working on other “broken” nodes.
Again, users usually don't care what happens under the hood if the GUI
shows what they are interested in. You can easily script the submission
of a dag fragment, and its up to the GUI to display it as part of the
original job. By adding custom attributes to these partial jobs you can
keep track of whats happening. Or if you want absolute control you can
make your own dagman replacement that does the job submission and
resubmission as you want.
It looks to me like we’d have a hard time getting Condor/Dagman to
support these needs. I’d love any advice / comments you might have on
this.
Solving your needs is just a matter of scripting the job execution / job
monitor scripts. The puzzling complexity of job scheduling is what makes
condor a tough bird to handle (compared to off-the-shelf render
management software).
Its all just a matter of personal opinion off course.
Cheers,
Szabolcs