[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] My dag is frozen




Hey Kent,

thanks so much for the tip, condor_hold and condor_release worked, it forced the stuck subdag to exit with status 1, which in turn produced a main_dag.rescue001, which I have resubmitted and now jobs are being triggered to the cluster again.

Note that we're still running condor v7.0.1 here in morgane, yet the trick worked as well.

Cheers,
Lucia


R. Kent Wenger wrote:
On Fri, 18 Jul 2008, Lucia Santamaria wrote:

for the last 4 days my dagman in morgane seems frozen and doesn't trigger
more jobs. The dag is not yet finished, as you can see if you execute
...
and also, _no_ rescue dag is created. If I look at one of the
dag.dagman.out corresponding to one of the subdags that are not yet
finished (for instance, cat2 dags in nsbhinj):
...
7/18 14:15:26 319886 seconds since last log event
7/18 14:15:26 Pending DAG nodes:
7/18 14:15:26   Node 20abc05cccfa0bf1b7e41fa441b90524, Condor ID 169496,
status STATUS_SUBMITTED
7/18 14:25:26 320486 seconds since last log event
...

I've seen something like that once before. At that time, the cause was that the file descriptor for a node job's user log file somehow became disconnected from the actual file, without creating any errors when it was read -- it just never reported any more bytes available (but poking around in /proc/*/fd revealed some problems). That might be what's happening now. (BTW, are your user log files on a local filesystem? I vaguely remember that in the previous case the user log files may have been on a shared filesystem.)

Anyhow, if you do a condor_hold and then a condor_release on the "stuck" condor_dagman(s), I think that will fix things. (Hopefully you are running the 7.1.1 pre-release DAGMan, which has the "fast recovery" fix.)

Kent Wenger
Condor Team
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/

--
--------------------------------------------
Lucia Santamaria
Max-Planck-Institut fuer Gravitationsphysik
Albert-Einstein-Institut
Am Muehlenberg 1, 17746 Golm, Germany
Office: +49(0)331-567-7181
---------------------------------------------