I ended up solving my problem by jerry-rigging my own crappy DAGMan. Basically in my script that I give to condor as an executable I have the following two lines
while [ $(condor_q -json 10 11 12 | wc -l) -gt 0 ]; do sleep 5; echo "Sleeping on 10 11 12"; done
condor_history 10 11 12 -af ExitCode | awk '$1!=0{{exit 1}}' || exit -1
Where "10 11 12" is a list of the clusterids I depend on (I don't see why jobids couldn't be added as well). The first line blocks until all the dependent jobs are done,
and the next line checks their exitcodes and exits with -1 if any of them are nonzero (and lets the script continue otherwise).
My existing code was already doing this loop where it started from the base nodes, submitted them, and recorded their clusterid, so that in the next iteration of the loop
I can add their clusterids to the list above (the existing code uses SLURM and I am transitioning us to Condor). As I write this I realize that it has the side effect of not
allowing nodes at the same level to run in parallel. I guess I'll have to fix that later.
I think another option for this scenario is to have job A mentioned above look for an existing instance of A and use the above condor_q/condor_history to latch onto the existing
job and its exitcode instead of starting a new one. Then I could use DAGMan appropriately.
I had tried using condor_wait, but unfortunately it needs the logfile in addition to the cluster/jobid, and then I'd still need to use condor_history to get the exitcodes.
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Belakovski, Nickolai via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Friday, September 27, 2024 2:42 PM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: Belakovski, Nickolai <nickolai.belakovski@xxxxxxxx> Subject: [HTCondor-users] Can a DAGMan node be an already running job?
Let's say I have the following two DAGs
A -> B
A -> C
Where A is a long running job. There's an A job running because someone has invoked the A -> B workflow. Now someone comes along and wants to invoke the A -> C workflow. I don't want to re-run A, but I can't mark it as DONE in the DAGfile because then C will
start running immediately.
Do I need to go outside condor/dagman and have A write to some sort of "done" file and have C block on that file before running?
|