Hi Lunyang,
I got ahead of myself and before I wrote up an answer, I quickly wrote a python script that can do what you desire using the htcondor python bindings. The script does the following based on a provided DAGMan job proper cluster id:
There are some assumptions occurring in this script that makes it not fully comprehensive like:
I have attached the script. Feel free to check it out, use it, and/or modify it.
Hope this helps,
Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Pelletier, Michael V. RTX via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Friday, October 27, 2023 8:41 AM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Cc: Pelletier, Michael V. RTX <Michael.V.Pelletier@xxxxxxx> Subject: Re: [HTCondor-users] [External] Restart submitted dag The view of HTCondor into the jobs it is running only extends to the level of the process and its ID number. The only way that HTCondor recognizes that a task is terminated is when the process terminates and delivers an exit code.
If the OSError is being caught in some way, and not resulting in the exit of the process, there's nothing visible to HTCondor that would indicate that it is not still running. You can see this kind of behavior sometimes with certain versions of MATLAB - when you call it from the command line and the function call or routine you specified fails, it drops you to the MATLAB command prompt instead of exiting MATLAB, leaving the process hanging waiting for user input that will never come. I think the "-batch" command line option for MATLAB does an implicit exit(); after the function call, but it's also common to put that in the command line as well. So, take a closer look at the failed task and see what's going on around it. Maybe a subprocess failed and the parent process didn't pass along that failure into its own termination and exit code. Remember, the "startd" starts the "starter," and the starter starts the executable/arguments. I find "pstree" useful for dissecting this sort of situation. Michael Pelletier Principal Technologist High Performance Computing Infrastructure & Workplace Services C: +1 339.293.9149 michael.v.pelletier@xxxxxxx -----Original Message----- From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of ??? Sent: Thursday, October 26, 2023 8:41 PM To: htcondor-users@xxxxxxxxxxx Subject: [External] [HTCondor-users] Restart submitted dag Hi there, I have been using Dagman to organize workflows. It’s been great. Recently I run into issues where some dag has one or two tasks left not finished. condor_q just shows these two tasks kept running. The task’s stderr shows the task runs into OSError but condor does not stop the task. I have find remove the whole dag and resubmitted via rescue Dag fix the issue (error is unpredictable and transient). But to do that, I need to dig out the dag file I submitted previously. I have two questions: * Are there smart way to remove a dag and resubmitted the dag either through CLI or python binding without knowing the location of dag file. Like some restart functionality of dag that recognized rescue dag. * Are there known issues task would not recognized as terminated by htcondr ? I am using a OS debian 10. so I can only use htcondor 9 in my system. Probably there are bugs? and maybe I can set some job run max time as a workaround? Any idea which config I need to set for condor? For context, I am running condor in my personal computer. I can configure the pool. Thanks a lot! Best, Lunyang _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://urldefense.us/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwIGaQ&c=MASr1KIcYm9UGIT-jfIzwQg1YBeAkaJoBtxV_4o83uQ&r=4PJgb1eyyvhzSV4fRwSECGK3jb50YP8vZUAedXybzgaNykar_o0SxKOUPkRHE0WG&m=mSAlYyj4nzWLkREmXxdJbW8GGSfsF4nfK4pRMxeAChdyCHeFiejvACuYtg7jG-QN&s=0zLoofQWlpAWvo2xdR0Mz9ZpnmvHLLQZ1sMbYykn6E8&e= The archives can be found at: https://urldefense.us/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwIGaQ&c=MASr1KIcYm9UGIT-jfIzwQg1YBeAkaJoBtxV_4o83uQ&r=4PJgb1eyyvhzSV4fRwSECGK3jb50YP8vZUAedXybzgaNykar_o0SxKOUPkRHE0WG&m=mSAlYyj4nzWLkREmXxdJbW8GGSfsF4nfK4pRMxeAChdyCHeFiejvACuYtg7jG-QN&s=agS-4wJQnrr0KLnqvD-GNREQ_zS_kl3CpfONXvucjtg&e= _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ |
Attachment:
dag_restart
Description: dag_restart