I agree with this, itâs a problem Iâve wrestled with in Torque for years (where the problem is even worse, as Torque itself gets in trouble when users do this). Christophâs suggestion is in my opinion in the right direction, although 3h is for me too extreme â Iâd suggest something more like a few (somewhere between 2 and 10) minutes as a baseline. What I would not like to do is to penalise all users for the clumsy behaviour of a few. The way to do this is to link the userâs allowed start rate to the completion rate for that same user. If a user is submitting hundreds of jobs that all take 3 hours, fine if they ramp up quickly. On the other hand, a hundred few-second jobs generates a high completion rate (if the start rate is high). Another issue to take into account is that a high start rate can put pressure on other systems, like shared file systems. What comes to mind: 1. A baseline start rate â something like 1 Hz = 60 jobs per minute? 2. A desired minimum duration, say two minutes 3. This defines a maximum desired completion rate = N_run / 120 sec When that completion rate is exceeded, one disables new job starts from that user until the completion rate drops below the threshold. You can calculate what a steady-state population is based on the job run times: 1 second job : only one job running at a time, completion rate 1 Hz, allowed completion rate 1 / 120 s = 0.008 Hz 2 second job : two jobs running simulateously, completion rate 1 Hz, allowed completion rate 2 / 120 s = 0.02 Hz [ â ] 60 second jobs : 60 jobs running simultaneously, completion rate 1 Hz, allowed completion rate 60 / 120 s = 0.5 Hz 2 minute jobs : 120 jobs running simultaneously, completion rate 1 Hz, allowed completion rate 120 / 120 s = 1 Hz 4 minute jobs : 240 jobs running simultaneously, completion rate 1 Hz, allowed rate 2 Hz Given an infinite number of jobs in the queue and a 1 Hz start rate (enough free cores to do this) then the completion rate will be zero at the start, one job duration later it will rise to 1 Hz and stay there. If the jobs take less than two minutes, the computed maximum desired completion rate (3) will be lower than 1 Hz and submissions will be disabled for part of the time. The shorter the jobs, the more frequent/aggressive this disabling is. For jobs taking more than two minutes, the allowed rate is higher than the 1 Hz start rate so effectively there will be no limit. I like this sort of approach because of the proportionality - a job with duration just under 2 minutes barely has an effect, while jobs with seconds runtime are severely limited. Another thing that comes to mind is to allow users to make the baseline start rate smaller than 1 Hz â sometimes they know that the job startup process is heavy and that you donât want too many at the same time - you put the power into their hands to achieve this. JT
|