I have a big DAG job I run on a Windows pool, and sometimes I want to
cancel it midway through. If the cluster_id of the DAG job itself is
2054, issuing 'condor_rm 2054' seems like it's supposed to clean things
up, but I'm having problems. It seems to screw up quite often in
different ways. I get errors like "Couldn't find/remove all jobs in
cluster 2054". I get jobs stuck in the "X state" even though this all on
a LAN and I can see that nothing is left running. Sometimes jobs are
left stuck permanently in the "'I' state" but then condor_release on the
job fails. Also, sometimes I get ghost condor_shadow processes on the
submit machine even though condor_q is empty and there is clearly
nothing left running in the pool. I have to manually kill the
condor_shadow processes.
Is there a better way to terminate a DAG job? Some sort of constraint
argument to condor_rm with the cluster id?