HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] proposal: PREEMPT_AND_HOLD





Erik Paulson wrote:
Not a fan.

I'm all for relaying some sort of eviction information back to the schedd,
but a startd should never have any say about what's happening in the queue.

Whether it should _stay_ on hold is up to the user/schedd-controlled policy. The hold mechanism is just a well-defined state to put the job in after an exception, in my view. We already do this for other events, such as failure to exec the job, missing input file, missing output file.


To use your example, just because a job runs out of virtual memory on one
machine doesn't mean that it wouldn't succeed on another machine, so why
should it go on hold?

If the user doesn't want to trigger this exceptional case, they should set their memory requirements to prevent it from happening. I as a user would rather find out that my jobs are failing and restarting rather than have condor try to silently get the job done and perhaps waste a lot of time in the process. Suppose I set my memory requirements too low: there is nothing to stop the job from getting matched back to machines with too little memory over and over.

And if I was a user who for some reason wanted my jobs to ignore this exception, then I could set my periodic_hold accordingly. I just don't think that should be the default.

Now, if the schedd wants to use information from the eviction to make a scheduling decision, I'm all for that.

The hold state already provides a way to feed back information about an exception and it provides policy controls, both globally in the schedd, and on a per-job basis. I can imagine an additional mechanism that is similar to this but which has slightly different behavior. Do you think this is really necessary?

--Dan