Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] startd hangs when using job hooks
- Date: Tue, 9 Feb 2010 10:47:53 -0500
- From: Michael Moore <mtmoore@xxxxxxxxxxx>
- Subject: Re: [Condor-users] startd hangs when using job hooks
On Tue, Feb 09, 2010 at 10:04:16AM -0500, Matthew Farrellee wrote:
> Michael Moore wrote:
> > I am trying to implement a set of fetch and prepare hooks. However, when
> > testing the hooks I experience hangs of condor_startd. When startd hangs
> > it quits responding to requests and condor shutdowns. Only a process
> > level kill ends the process.
> >
> > The host running the hooks is a Windows Vista host running Condor 7.4.1.
> > The prepare hook does take some time to run (on the order of minutes).
> > However, startd does not always hang during the prepare hook. Sometimes
> > startd hangs after the job begins executing, sometimes it doesn't hang
> > at all.
> >
> > Has anyone else seen similar behavior? Was there a way to work around
> > the problem? Apparently, there was a similar problem in 7.3.2 and prior
> > where a very simple fetch hook would cause startd to hang. I haven't
> > figured out what portion of the hook triggers this behavior, it's very
> > intermittent.
> >
> > Thanks,
> > Michael Moore
>
> A few issues with hooks on Windows...
>
> http://condor-wiki.cs.wisc.edu/index.cgi/search?s=hook+windows
>
> Specifically...
>
> http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=422
> http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=864
>
> Do either of those sound like your problem?
>
> I believe one of those is related to using Windows on a machine with
> many CPUs -- or at least it is more reproducible there.
>
> Best,
>
>
> matt
Matt,
Ticket 422 is the previous issue I mentioned above. I did test to make
sure I wasn't seeing that issue but it seems to be correctly resolved in
my testing. The second issue may exist but I don't get that far.
startd will hang before the job completes. ITicket 864 is not the
issue I'm seeing. A good way to describe it is the same symptoms of
ticket 422 but the issue is not as reproducible and not caused by the
simple case provided in that ticket.
I can confirm I see the issue when I force the number of slots to 1. I
don't know about the level of reproducibility.
Thanks for the help!
Michael