Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Condor-users] Strange schedd crash (exit status 44)
- Date: Thu, 25 Nov 2004 12:02:42 -0500
- From: "Ian Chesal" <ICHESAL@xxxxxxxxxx>
- Subject: RE: [Condor-users] Strange schedd crash (exit status 44)
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of matthew hope
> Sent: November 25, 2004 11:56 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Strange schedd crash (exit status 44)
>
> On Thu, 25 Nov 2004 11:41:22 -0500, Ian Chesal
> <ichesal@xxxxxxxxxx> wrote:
> > > From: On Behalf Of matthew hope
> > >
> > > If you keep the submitters at 6.6 and the startd's (and possibly
> > > negotiator?) at 6.7.2 then the retirement should work (modulo not
> > > being able to choose to use less)
> > >
> > > caveat: this is working from the docs not from having tried it
> >
> > Okay. I'll have to try this out. It seems like a
> complicated surgery
> > to perform though: more than just schedd would have to be
> replaced, no?
> > Wouldn't condor_q and condor_submit also need to be revert to the
> > 6.6.x binaries?
>
> I meant keep the entire submit machine at 6.6, i.e. uninstall
> / reinstall
>
> > > shadows and schedd's both have serious stability issues
> >
> > Agreed. All of our troubles have been with schedd and
> shadows. To the
> > condor team then: when can we expect these to stabilize?
>
> I understand that the team are looking into it but that the
> Thanksgiving holiday in the US is inevitably going to add some delay.
> In fairness it is a dev release, I am thinking I was too
> hasty in assuming it would scale to high loads (but how can I
> test prod load without using the prod system :).
Ahh. Forgot about those darn US holidays.
> Has anyone else got a high load (>100 execute nodes) with a
> small number of submit points keeping the whole thing highly loaded?
>
> If they have on *nix land but not windows this would indicate
> strongly that it was a windows port issue.
I have sucessfully submitted 200+ jobs from my Linux machine that were
targetting our Windows startd machines and the schedd and shadows ran
without any problems. But there was limited concurrency because I too am
not willing to roll this out to our production machines to test a
production load.
I think we need to hear from the Condor team here: what's up with
Windows? Are you guys aware of these issues?
> > > not so hot - is your submit machine under heavy load as well?
> >
> > Talking with both the engineers I would say, yes. The machines were
> > probably running processes unrelated to Condor for the
> users and it is
> > likely the load was high. That being said, these are all dual
> > processor
> > 1 GHz PIII machines with 2GB or more of RAM. Not state of
> the art but
> > certainly not under powered. The jobs are vanilla jobs that
> transfer
> > one small 1k file to the client when they start and then
> transfer back
> > ~30k worth of captured STDOUT when the jobs complete.
>
> If the user constantly runs condor_q (or someone else runs condor_q
> -global) they can seriously affect the schedd.
>
> It is a vicious circle where the user goes "Why is it so
> slow? Whats going on?"
> <runs condor_q>
> "That's bad! I will watch this kettle till it boils"
> <runs condor_q repeatedly>
Agreed, but this is not likely the cause in our system. In the case of
the first crash it was well before anyone was in the office.
> > I'm with you here: our main batch system is a very simple, in-house
> > system that can handle queues loads in the tens of
> thousands of jobs
> > running on a single CPU 866 MHz PIII with 256 MB of RAM. Granted it
> > hasn't the complexity of Condor but we're not exploiting all the
> > interactive capabilities or file transfer capabilities of
> Condor. The
> > bulk of our job file transfer is handled by the job scripts we run,
> > not by Condor, and not to the machine that submitted the job.
>
> The issue is that the batch system does not need to talk to a
> central machine to be told to ten talk to the execute
> machine, nor bother to repeatedly stroke the executor to keep
> it happy.
>
> I think that sufficient people run in a tightly coupled and
> dedicated environment to make it worthwhile making the
> negotiation process more pluggable to allow us to exploit
> this (making the negotiator more intelligent but the startd
> more stupid or vice versa)...
>
> that said I'll go for stability over features every time at
> the moment!
Here here! Hopefully there'll be an early Christmas present from the
Condor team in the form of a 6.8.x stable branch. Fingers crossed...
Ian