Erik Paulson wrote:
On Mon, Mar 03, 2008 at 01:17:14PM -0600, Dan Bradley wrote:Whether it should _stay_ on hold is up to the user/schedd-controlled policy. The hold mechanism is just a well-defined state to put the job in after an exception, in my view. We already do this for other events, such as failure to exec the job, missing input file, missing output file.Exceeding memory is not an exceptional event - trying it again somewhereelse could succeed, especially since our memory usage data is so out of touch with reality.
Exceeding the amount of memory that the user specified in their memory requirements is an exceptional event, in my opinion. We have no reason to believe that matchmaking will do a better job of putting the job on a bigger memory machine. It could just loop forever. Therefore, it is very likely a malformed job.
The hold state already provides a way to feed back information about an exception and it provides policy controls, both globally in the schedd, and on a per-job basis. I can imagine an additional mechanism that is similar to this but which has slightly different behavior. Do you think this is really necessary?I don't think the startd should _ever_ be able to put the job on hold. I think the startd should return information about why a job is being thrown off a machine - but even in the case like "failed to exec", it should be the schedd that decides the fate of the job, with sane defaults.
I agree with this in principle, but in practice, if the sane defaults are in fact to put the job on hold and then use the hold policy to decide what to do, then it comes down to an implementation detail.
--Dan