[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-devel] proposal: PREEMPT_AND_HOLD
- Date: Mon, 3 Mar 2008 13:33:54 -0600
- From: Erik Paulson <epaulson@xxxxxxxxxxx>
- Subject: Re: [Condor-devel] proposal: PREEMPT_AND_HOLD
On Mon, Mar 03, 2008 at 01:17:14PM -0600, Dan Bradley wrote:
>
>
> Erik Paulson wrote:
> >Not a fan.
> >
> >I'm all for relaying some sort of eviction information back to the schedd,
> >but a startd should never have any say about what's happening in the queue.
> >
>
> Whether it should _stay_ on hold is up to the user/schedd-controlled
> policy. The hold mechanism is just a well-defined state to put the job
> in after an exception, in my view. We already do this for other events,
> such as failure to exec the job, missing input file, missing output file.
>
Exceeding memory is not an exceptional event - trying it again somewhere
else could succeed, especially since our memory usage data is so out of
touch with reality.
Failing to exec a job _could_ be an exceptional event, or it could be a
failure specific to that machine, and would succeed somewhere else.
The missing input file will always be a failure, since Condor can't somehow
cons the file up to send it to the client. Why didn't the schedd detect this
right away before even trying to run the job?
>
> >To use your example, just because a job runs out of virtual memory on one
> >machine doesn't mean that it wouldn't succeed on another machine, so why
> >should it go on hold?
> >
>
> If the user doesn't want to trigger this exceptional case, they should
> set their memory requirements to prevent it from happening. I as a user
> would rather find out that my jobs are failing and restarting rather
> than have condor try to silently get the job done and perhaps waste a
> lot of time in the process. Suppose I set my memory requirements too
> low: there is nothing to stop the job from getting matched back to
> machines with too little memory over and over.
>
> And if I was a user who for some reason wanted my jobs to ignore this
> exception, then I could set my periodic_hold accordingly. I just don't
> think that should be the default.
>
> >Now, if the schedd wants to use information from the eviction to make a
> >scheduling decision, I'm all for that.
> >
>
> The hold state already provides a way to feed back information about an
> exception and it provides policy controls, both globally in the schedd,
> and on a per-job basis. I can imagine an additional mechanism that is
> similar to this but which has slightly different behavior. Do you think
> this is really necessary?
>
I don't think the startd should _ever_ be able to put the job on hold.
I think the startd should return information about why a job is being
thrown off a machine - but even in the case like "failed to exec", it should
be the schedd that decides the fate of the job, with sane defaults.
-Erik