HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] proposal: PREEMPT_AND_HOLD



On Mon, Mar 03, 2008 at 01:17:14PM -0600, Dan Bradley wrote:
> 
> 
> Erik Paulson wrote:
> >Not a fan.
> >
> >I'm all for relaying some sort of eviction information back to the schedd,
> >but a startd should never have any say about what's happening in the queue.
> >  
> 
> Whether it should _stay_ on hold is up to the user/schedd-controlled 
> policy.  The hold mechanism is just a well-defined state to put the job 
> in after an exception, in my view.  We already do this for other events, 
> such as failure to exec the job, missing input file, missing output file.
> 

Exceeding memory is not an exceptional event - trying it again somewhere
else could succeed, especially since our memory usage data is so out of 
touch with reality.

Failing to exec a job _could_ be an exceptional event, or it could be a 
failure specific to that machine, and would succeed somewhere else.

The missing input file will always be a failure, since Condor can't somehow
cons the file up to send it to the client. Why didn't the schedd detect this
right away before even trying to run the job?

> 
> >To use your example, just because a job runs out of virtual memory on one
> >machine doesn't mean that it wouldn't succeed on another machine, so why
> >should it go on hold?
> >  
> 
> If the user doesn't want to trigger this exceptional case, they should 
> set their memory requirements to prevent it from happening.  I as a user 
> would rather find out that my jobs are failing and restarting rather 
> than have condor try to silently get the job done and perhaps waste a 
> lot of time in the process.  Suppose I set my memory requirements too 
> low: there is nothing to stop the job from getting matched back to 
> machines with too little memory over and over.
> 
> And if I was a user who for some reason wanted my jobs to ignore this 
> exception, then I could set my periodic_hold accordingly.  I just don't 
> think that should be the default.
> 
> >Now, if the schedd wants to use information from the eviction to make a 
> >scheduling decision, I'm all for that.
> >  
> 
> The hold state already provides a way to feed back information about an 
> exception and it provides policy controls, both globally in the schedd, 
> and on a per-job basis.  I can imagine an additional mechanism that is 
> similar to this but which has slightly different behavior.  Do you think 
> this is really necessary?
> 

I don't think the startd should _ever_ be able to put the job on hold.
I think the startd should return information about why a job is being
thrown off a machine - but even in the case like "failed to exec", it should
be the schedd that decides the fate of the job, with sane defaults.

-Erik