[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Restart submitted dag



Hi there, 

I have been using Dagman to organize workflows. Itâs been great. Recently I run into issues where some dag has one or two tasks left not finished. condor_q just shows these two tasks kept running. The taskâs stderr shows the task runs into OSError but condor does not stop the task. I have find remove the whole dag and resubmitted via rescue Dag fix the issue (error is unpredictable and transient). But to do that, I need to dig out the dag file I submitted previously. I have two questions:

* Are there smart way to remove a dag and resubmitted the dag either through CLI or python binding without knowing the location of dag file. Like some restart functionality of dag that recognized rescue dag. 

* Are there known issues task would not recognized as terminated by htcondr ? I am using a OS debian 10. so I can only use htcondor 9 in my system. Probably there are bugs? and maybe I can set some job run max time as a workaround? Any idea which config I need to set for condor? For context, I am running condor in my personal computer. I can configure the pool. 

Thanks a lot!

Best,
Lunyang