Scott,The noop route seems the most appealing. I tried it and it appeared to work for a while, but I think I ran into a Condor bug ~200 jobs into the noop jobs:
... ...3/15 17:20:03 submitting: condor_submit -a dag_node_name' '=' '97d736e41f382cd8477d1ff0e8ae484f -a +DAGManJobId' '=' '1653119 -a DAGManJobId' '=' '1653119 -a submit_event_notes' '=' 'DAG' 'Node:' '97d736e41f382cd8477d1ff0e8ae484f -a macrooutput' '=' 'L1- SIRE_FIRST_GRB070714B_injections21-868429419-2048.xml -a macrosummary' '=' 'L1-SIRE_FIRST_GRB070714B_injections21-868429419-2048.txt -a macrousertag' '=' 'GRB070714B_injections21 -a macroglob' '=' 'L1- INSPIRAL_FIRST_GRB070714B_injections21_150-868429419-2048.xml.gz -a macroifocut' '=' 'L1 -a +DAGParentNodeNames' '=' '"eb25ace1446075cf9f2d9f0eb93e0ae6" injections21.sire.GRB070714B_injections21.sub
3/15 17:20:04 From submit: Submitting job(s). 3/15 17:20:04 From submit: Logging submit event(s). 3/15 17:20:04 From submit: 1 job(s) submitted to cluster 1653253. 3/15 17:20:04 assigned Condor ID (1653253.0) ... ... 3/15 17:20:04 Number of idle job procs: 03/15 17:20:04 Event: ULOG_JOB_TERMINATED for Condor Node 97d736e41f382cd8477d1ff0e8ae484f (1653253.0)
3/15 17:20:04 BAD EVENT: job (1653253.0.0) ended, submit count < 1 (0) 3/15 17:20:04 BAD EVENT is warning only3/15 17:20:04 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs >= 0)" at line 3024 in file dag.C
On Mar 14, 2008, at 7:22 PM, Scott Koranda wrote:
Hi Nick, Rather than setting 'executable = /bin/true' you could add to the submit file 'hold = True'. The child jobs will then be submitted and held and will not run unless you explicitly call condor_release on them. In a similar way you could set 'noop_job = True' for the child jobs and the jobs will simply be marked as completed with a return value of 0. ScottDear all,After a DAG has run partway through, I've decided that the bottom- mostpost-processing job (several thousand of them) should/can not be run. When my rescue DAG comes, as it inevitably does, I would like not to execute these. So far, no problem; a one-line bash/sed invocation takes care of that: cat $f | sed 's/.*mysubfile.*/& DONE/' > ${f}.sires_done; The problem is that not all of the parents have completed successfully. I'd like to resubmit the parents, but not these children. When I naively mark them as DONE, as above, I get the following error while dagman parses the DAG. 3/13 20:25:13 ERROR: AddParent( ea0bca7d3503cccca43dff66a99c1516 ) failed for node a5bf08f49f3323fdd5f838f6d89918f7: STATUS_DONE child may not begiven a n ew STATUS_READY parent Removing the JOB lines produces an error that the parent-child relationships refer to a non-existent job. (I don't have the exact message handy.) I see a few solutions, none of which I like: * resubmit without modification and let the children fail (wastes resources) * change the submit files to point to /bin/true and run in the local universe (a lot of scheduling overhead, I'd think, but maybe this is negligible) * identify all nodes of a class and remove all references to each of them (more code than I want to write at the moment)Can I get some gut reactions to these options or perhaps new, clevereroptions? Thanks, Nick =================================== Nickolas Fotopoulos nvf@xxxxxxxxxxxxxxxxxxxx Office: (414) 229-6438 Fax: (414) 229-5589 University of Wisconsin - Milwaukee Physics Bldg, Rm 471 =================================== _______________________________________________ Condor-users mailing listTo unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with asubject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/
=================================== Nickolas Fotopoulos nvf@xxxxxxxxxxxxxxxxxxxx Office: (414) 229-6438 Fax: (414) 229-5589 University of Wisconsin - Milwaukee Physics Bldg, Rm 471 ===================================