Scott,
The noop route seems the most appealing. I tried it and it
appeared to work
for a while, but I think I ran into a Condor bug ~200 jobs into the
noop
jobs:
...
...
3/15 17:20:03 submitting: condor_submit -a dag_node_name' '='
'97d736e41f382cd8477d1ff0e8ae484f -a +DAGManJobId' '=' '1653119 -a
DAGManJobId' '=' '1653119 -a submit_event_notes' '=' 'DAG' 'Node:'
'97d736e41f382cd8477d1ff0e8ae484f -a macrooutput' '='
'L1-SIRE_FIRST_GRB070714B_injections21-868429419-2048.xml -a
macrosummary'
'=' 'L1-SIRE_FIRST_GRB070714B_injections21-868429419-2048.txt -a
macrousertag' '=' 'GRB070714B_injections21 -a macroglob' '='
'L1-
INSPIRAL_FIRST_GRB070714B_injections21_150-868429419-2048.xml.gz -a
macroifocut' '=' 'L1 -a +DAGParentNodeNames' '='
'"eb25ace1446075cf9f2d9f0eb93e0ae6"
injections21.sire.GRB070714B_injections21.sub
3/15 17:20:04 From submit: Submitting job(s).
3/15 17:20:04 From submit: Logging submit event(s).
3/15 17:20:04 From submit: 1 job(s) submitted to cluster 1653253.
3/15 17:20:04 assigned Condor ID (1653253.0)
...
...
3/15 17:20:04 Number of idle job procs: 0
3/15 17:20:04 Event: ULOG_JOB_TERMINATED for Condor Node
97d736e41f382cd8477d1ff0e8ae484f (1653253.0)
3/15 17:20:04 BAD EVENT: job (1653253.0.0) ended, submit count < 1
(0)
3/15 17:20:04 BAD EVENT is warning only
3/15 17:20:04 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs
>= 0)" at
line 3024 in file dag.C
On Mar 14, 2008, at 7:22 PM, Scott Koranda wrote:
Hi Nick,
Rather than setting 'executable = /bin/true' you could add to
the submit file 'hold = True'. The child jobs will then be submitted
and held and will not run unless you explicitly call
condor_release on them.
In a similar way you could set 'noop_job = True' for the child
jobs and the jobs will simply be marked as completed with a
return value of 0.
Scott
Dear all,
After a DAG has run partway through, I've decided that the bottom-
most
post-processing job (several thousand of them) should/can not be
run.
When my rescue DAG comes, as it inevitably does, I would like not
to
execute these. So far, no problem; a one-line bash/sed invocation
takes care of that:
cat $f | sed 's/.*mysubfile.*/& DONE/' > ${f}.sires_done;
The problem is that not all of the parents have completed
successfully. I'd like to resubmit the parents, but not these
children. When I naively mark them as DONE, as above, I get the
following error while dagman parses the DAG.
3/13 20:25:13 ERROR: AddParent( ea0bca7d3503cccca43dff66a99c1516 )
failed for no
de a5bf08f49f3323fdd5f838f6d89918f7: STATUS_DONE child may
not be
given a n
ew STATUS_READY parent
Removing the JOB lines produces an error that the parent-child
relationships refer to a non-existent job. (I don't have the exact
message handy.)
I see a few solutions, none of which I like:
* resubmit without modification and let the children fail (wastes
resources)
* change the submit files to point to /bin/true and run in the
local
universe (a lot of scheduling overhead, I'd think, but maybe this
is
negligible)
* identify all nodes of a class and remove all references to each
of
them (more code than I want to write at the moment)
Can I get some gut reactions to these options or perhaps new,
cleverer
options?
Thanks,
Nick
===================================
Nickolas Fotopoulos
nvf@xxxxxxxxxxxxxxxxxxxx
Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-
request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
===================================
Nickolas Fotopoulos
nvf@xxxxxxxxxxxxxxxxxxxx
Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================