Subject: Re: [HTCondor-users] proposed change in DAGMan
From: "R. Kent Wenger" <wenger@xxxxxxxxxxx> Date: 06/15/2016 02:12 PM
> We are proposing a change in DAGMan behavior
relative to node jobs that
> are on hold, and before implementing it, we wanted to get feedback
from
> the HTCondor user community.
>
> Right now, DAGMan will wait indefinitely for jobs that are on hold,
even
> if *all* of the node jobs for the DAG are on hold and, therefore,
no
> progress is being made.
>
> The proposed change is that, if DAGMan is "stuck" because
all queued node
> jobs are on hold (and there are no ready jobs, running PRE/POST scripts,
> etc.), DAGMan will consider this a failure and abort the DAG (which
> results in all queued node jobs being removed, and a rescue DAG being
> generated).
>
> Users would be able to opt out of the new behavior via a configuration
> setting.
>
> Please let us know what you think of this proposal...
My recently-implement update_job_info hook enables
users to run a periodic hold and periodic release to restart a hung-but-running
job - perhaps have DAGman wait for an update interval to elapse before
taking action to insure that a held job isn't going to be released on the
next pass?