From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Ben Cotton <ben.cotton@xxxxxxxxxxxxxxxxxx>
Sent: Wednesday, June 15, 2016 1:31 PM To: HTCondor-Users Mail List Subject: Re: [HTCondor-users] proposed change in DAGMan On Wed, Jun 15, 2016 at 2:08 PM, R. Kent Wenger <wenger@xxxxxxxxxxx> wrote:
> The proposed change is that, if DAGMan is "stuck" because all queued node > jobs are on hold (and there are no ready jobs, running PRE/POST scripts, > etc.), DAGMan will consider this a failure and abort the DAG (which results > in all queued node jobs being removed, and a rescue DAG being generated). > I'm curious as to the motivation for this. If I understand the proposal correctly, this leaves workflows with a single node at some level (e.g. diamond DAGs) vulnerable to instant-kaboom if there's a problem. Sure, the user can just submit the rescue DAG, but that doesn't help if the submission happens through some intermediary (which is a common use case for some of our customers). What if it was a timeout? In other words, the config setting is "abort if the DAG has been stuck for at least N seconds"?
The motivation is that right now, if a DAG gets into the "stuck" state, it will stay in that state forever unless the user does something (or the node jobs get released somehow), and it's not very obvious to the user what's going on.
> I think this functionality would be a good addition, but why opt-out instead of opt-in? Well, if it's opt-in probably very few users will take advantage of it...
Kent
|