Re: [HTCondor-users] manipulate ranking/priority of very-short-jobs-users

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Wed, 16 Aug 2023 14:20:42 +0200

From: Jeff Templon <templon@xxxxxxxxx>

Subject: Re: [HTCondor-users] manipulate ranking/priority of very-short-jobs-users

I agree with this, itâs a problem Iâve wrestled with in Torque for years (where the problem is even worse, as Torque itself gets in trouble when users do this).

Christophâs suggestion is in my opinion in the right direction, although 3h is for me too extreme â Iâd suggest something more like a few (somewhere between 2 and 10) minutes as a baseline. What I would not like to do is to penalise all users for the clumsy behaviour of a few. The way to do this is to link the userâs allowed start rate to the completion rate for that same user.

If a user is submitting hundreds of jobs that all take 3 hours, fine if they ramp up quickly.

On the other hand, a hundred few-second jobs generates a high completion rate (if the start rate is high).

Another issue to take into account is that a high start rate can put pressure on other systems, like shared file systems.

What comes to mind:

1. A baseline start rate â something like 1 Hz = 60 jobs per minute?

2. A desired minimum duration, say two minutes

3. This defines a maximum desired completion rate = N_run / 120 sec

When that completion rate is exceeded, one disables new job starts from that user until the completion rate drops below the threshold.

You can calculate what a steady-state population is based on the job run times:

1 second job : only one job running at a time, completion rate 1 Hz, allowed completion rate 1 / 120 s = 0.008 Hz

2 second job : two jobs running simulateously, completion rate 1 Hz, allowed completion rate 2 / 120 s = 0.02 Hz

[ â ]

60 second jobs : 60 jobs running simultaneously, completion rate 1 Hz, allowed completion rate 60 / 120 s = 0.5 Hz

2 minute jobs : 120 jobs running simultaneously, completion rate 1 Hz, allowed completion rate 120 / 120 s = 1 Hz

4 minute jobs : 240 jobs running simultaneously, completion rate 1 Hz, allowed rate 2 Hz

Given an infinite number of jobs in the queue and a 1 Hz start rate (enough free cores to do this) then the completion rate will be zero at the start, one job duration later it will rise to 1 Hz and stay there. If the jobs take less than two minutes, the computed maximum desired completion rate (3) will be lower than 1 Hz and submissions will be disabled for part of the time. The shorter the jobs, the more frequent/aggressive this disabling is. For jobs taking more than two minutes, the allowed rate is higher than the 1 Hz start rate so effectively there will be no limit.

I like this sort of approach because of the proportionality - a job with duration just under 2 minutes barely has an effect, while jobs with seconds runtime are severely limited.

Another thing that comes to mind is to allow users to make the baseline start rate smaller than 1 Hz â sometimes they know that the job startup process is heavy and that you donât want too many at the same time - you put the power into their hands to achieve this.

On 16 Aug 2023, at 12:42, MIRON LIVNY via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Yes! We need more per user/owner controls.

What else?

Miron.

Sent from my iPhone

On Aug 16, 2023, at 11:27, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:

ïHi,

for a long time we deal with people that are sending ultra-short jobs with a runtime of a couple of seconds (I assume they are looking at one ntuples or something similar).

These jobs cause probably more overhead computing time than goodput and give reasons for shared FS trouble every now and then. I would like to educate the users and make very short jobs un-attractive.

I was looking for something like 'max_jobstarts_per_user' (which does not exist) or a way to alter the slotweight and the priority by counting every jobstart like a full 3h running job in order to give these people a really bad priority but so far did not find a propper way to do so :(

Any ideas somebody ?

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] manipulate ranking/priority of very-short-jobs-users