Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] no rescue file created
- Date: Thu, 17 May 2012 09:50:49 -0500
- From: Nathan Panike <nwp@xxxxxxxxxxx>
- Subject: Re: [Condor-users] no rescue file created
On Thu, May 17, 2012 at 08:12:08AM -0400, Gautam Saxena wrote:
Hi Gautam:
> Are there circumstances when a rescue file fails to get created? And if so,
> is there a way to force its recreation?
>
> This is what happened: We were running a reasonalbly large DAG over 10 days
> or so.
> One of the main machines (the submitting machine actually) rebooted.
> (Not sure if this reboot is relevant.)
Yes it is relevant.
>Eventually, the dag seemed to finish
> (in that there was nothing actually running on any machine), but the "dag"
> job showed that there was 1 job on hold plus there was the actual dag job
> itself.
Was the DAGman job itself on hold?
> So, I did a condor_rm on the job that was on hold. That operation
> both removed the "holded" job as well as the "dag" job itself.
> However, no rescue file was created.
This sounds definitely wrong. Can you send me the .dagman.out file from
the run?
> Is this normal? (Also, I've noticed that if I do a condor_rm on the
> dag job itself, it will not produce a rescue file either -- is that
> normal too?)
If the condor_dagman job was on hold when you did a condor_rm, this is a
known bug.
Some relevant historical information is at the following tickets.
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2765
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1490
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2434
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2213
>
> -Gautam