I have a big DAG job I run on a Windows pool, and sometimes I want to cancel it midway through. If the cluster_id of the DAG job itself is 2054, issuing 'condor_rm 2054' seems like it's supposed to clean things up, but I'm having problems. It seems to screw up quite often in different ways. I get errors like "Couldn't find/remove all jobs in cluster 2054". I get jobs stuck in the "X state" even though this all on a LAN and I can see that nothing is left running. Sometimes jobs are left stuck permanently in the "'I' state" but then condor_release on the job fails. Also, sometimes I get ghost condor_shadow processes on the submit machine even though condor_q is empty and there is clearly nothing left running in the pool. I have to manually kill the condor_shadow processes. Is there a better way to terminate a DAG job? Some sort of constraint argument to condor_rm with the cluster id? Thanks. |