Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dagman Newbie Questions

Date: Fri, 10 Oct 2008 16:14:44 +0200
From: Horvátth Szabolcs <szabolcs@xxxxxxxxxxxxx>
Subject: Re: [Condor-users] Dagman Newbie Questions

Hi Jeremy,

Advance warning: I'm a Condor newbie. I’ve been tasked with evaluatingCondor as a queueing platform for CG & Animation production at ourfacility.

Please accept my most sincere condolences. :) While render managementwith Condor is certainly possible (and fun) it requires a differentmind-set compared to using Alfred, Deadline, Rush etc.

If I understand what’s going on, the dagman job appears to be simply ajob that submits other jobs, and the downstream jobs are not evensubmitted until their prerequisite jobs have run. Subsequent runs canonly be run again with the .rescue file. Is this correct?

Yes. DAGMan is just a simple condor job that submits child jobs when thedependencies allow and executes pre/post scripts. To re-run a branch oreven a single job of the DAG you have to resubmit it, either using therescue dag functionality or script your own command.

This has consequences for us because in our business, deadlines arecritical and resource utilization must be maximized. So progressiveestimates of completion and remaining work are necessary. We need allnodes of the entire DAG to be present in the system to estimateresource use, even though many of the nodes may be blocked waiting forprerequisite nodes. Jobs that submit other jobs are a nasty surprisefor our resource managers.

While I understand why you find it a limitation, in fact its a greatthing for queue load and schedule balancing. When shot TDs submit ahundred render layers a night, each consisting a few hundred jobs itcould easily choke the scheduler.

If you need information about the whole dag you still have multiple options:

- Add custom attributes to the dagman job that store how many tasks arein the whole dag by task type . (ribgen / mi file gen / render /composite / whatever)- Make your job progress window parse the DAG files and build the GUIbased on that data instead of the queried or quill DB data. Its a simpleway to get the hierarchical information.

Also re-running of any node and its dependent nodes is fairly commonand is often done many times during pipeline troubleshooting—we don’twant to have to re-submit the entire DAG several times in separateruns because there may be long-running nodes in the DAG we want tocontinue in parallel while we’re working on other “broken” nodes.

Again, users usually don't care what happens under the hood if the GUIshows what they are interested in. You can easily script the submissionof a dag fragment, and its up to the GUI to display it as part of theoriginal job. By adding custom attributes to these partial jobs you cankeep track of whats happening. Or if you want absolute control you canmake your own dagman replacement that does the job submission andresubmission as you want.

It looks to me like we’d have a hard time getting Condor/Dagman tosupport these needs. I’d love any advice / comments you might have onthis.

Solving your needs is just a matter of scripting the job execution / jobmonitor scripts. The puzzling complexity of job scheduling is what makescondor a tough bird to handle (compared to off-the-shelf rendermanagement software).

Its all just a matter of personal opinion off course.

Cheers,
Szabolcs

Follow-Ups:
- Re: [Condor-users] Dagman Newbie Questions
  - From: Jeremy Yabrow

References:
- [Condor-users] Dagman Newbie Questions
  - From: Jeremy Yabrow

Prev by Date: Re: [Condor-users] HYPHY parallel MPI implementation on Condor
Next by Date: Re: [Condor-users] Dagman Newbie Questions
Previous by thread: [Condor-users] Dagman Newbie Questions
Next by thread: Re: [Condor-users] Dagman Newbie Questions
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Dagman Newbie Questions