[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Adjusting machine RANK classad expr based on totalqueue time for a job





Ian Chesal wrote:

If the negoiator is simply connecting a startd with a schedd then is
there something amiss with the schedd when condor_rm is invoked? I would
have expected 44.2 to run after 44.0 whether the negotiator or the
schedd was deciding which job to had to the startd next.



The schedd doesn't apply any knowledge about the machine RANK when deciding which job to run next on a claim, so there can be a difference in how jobs are scheduled, depending on whether the schedd can run multiple jobs per claim or only one job per claim. (Condor schedd experts please correct me if I am wrong.)


However, your test seems to be showing that a retiring claim is allowed to run another job when the existing job is removed via condor_rm. This is clearly a bug. I'll look into it.

Thanks for the prying into the situation!

--Dan

Ian



-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
Sent: October 27, 2004 6:16 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Adjusting machine RANK classad exprbased ontotalqueue time for a job



Ian,


Could you specify which of the jobs in your various tests are being run by different users, if any? One potential point of confusion is that, by design, the Condor negotiator does not micromanage what the schedd does with a claim. Once the schedd gets a claim on behalf of a user, it will continue to run jobs on that claim until the claim is taken away or the user runs out of jobs. The negotiator doesn't tell the schedd which job to run next on the claim.

You can force renegotiation of claims after every job if you want. Something like the following policy will do this:

MaxJobRetirementTime = 1000000
WANT_SUSPEND = FALSE
PREEMPT = TRUE

--Dan

Ian Chesal wrote:



It looks like it was my use of condor_rm that messed up my predictability. I continued the experiment but this time I made sure the running 44.1 process finished normally instead of being

pre-maturly

terminated by condor_rm.

I had two queued jobs with their EnteredCurrentStatus times:

44.2 1098912677
44.3 1098910808

I expected 44.2 to rank lower than 44.3 by ~31. So 44.3

should be the

next job picked up.

And this was the case. My rank expression worked this time.

Excellent.


So here's a question for the condor team: If I was a "sneaky user" I could write a job that, after processing was complete sent

me an email

and then went to sleep for a long, long time. Upon receiving that email, if I used condor_rm to terminate the job I'd be able

to hang on

to the resource it was using and run another job on it. Even

if another

job, from another user, had a higher rank because condor_rm seems to prevent the machine from re-negotiating. This would give me infinite access to a resource. Can this happen?


Ian









-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
Sent: October 27, 2004 5:18 PM
To: Condor-Users Mail List
Subject: RE: [Condor-users] Adjusting machine RANK classad


expr based

ontotalqueue time for a job

Hmm. So I went with the RANK expression:

RANK = ((TARGET.JobStatus =?= 1) * ((CurrentTime -
TARGET.EnteredCurrentStatus)/60))

My plan was to make sure jobs that are queued rank higher

the longer

they've been in the queued state. In this case, +1 for every minute they've been sitting idle.

To test this I submitted some jobs in the held state. Jobs

are simple:


go to the machine and sleep for an hour.

I released three of the held jobs. My machine immediately picked up 44.0 from the cluster and started running.

I let the other two released jobs build up some queue time

while 44.0

slept on a machine. At one point I did see condor_status

show my 44.0

as being in the "Retiring" state instead of the "Busy"

state -- that

is good news. We have a long MaxJobRetirementTime so this is expected.

I let about 8 minutes lapse I then I issued the commmand:

condor_hold 44.1
condor_release 44.1

So this reset the EnteredCurrentStatus time on 44.1. I now

have 44.0

running, but retiring and the remaining two jobs each have EnteredCurrentStatus as follows:

44.1 1098910859
44.2 1098910279

By this output I expect 44.2 to have the higher rank. 44.0 is still running so I removed it with:

condor_rm 44.0

I expected the machine to pick up 44.2 as the next job because it's rank is higher, having been queued for a longer time that 44.1.

Not so. The machine picked up 44.1. I'm the only user in

the system so

it's not a matter of EUP. What's up? Why is it 44.2 didn't rank higher?
Can anyone see how I messed up my prediction for next job


to run? I'm

stumped. I thought I had it all figured out.

Thanks!

Ian





-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
Sent: October 27, 2004 11:34 AM
To: Condor-Users Mail List
Subject: [Condor-users] Adjusting machine RANK classad expr




based on




totalqueue time for a job

I'm toying with adjusting the RANK expression to achieve a more FIFO-like consideration when condor runs jobs. The idea is to rank jobs on machines based on their time in the queue.
I wanted to bounce the rank expression and idea off the list. The rank expression for machines I'm thinking of using is:


RANK = ((TARGET.JobStatus =?= 1) * ((CurrentTime -
TARGET.EnteredCurrentStatus)/600))

This would give a job queued 10 minutes longer than another job a higher rank on the machine.

The other option is:

RANK = ((CurrentTime - TARGET.QDate)/600)

But this would track cumulative queue time (so if the job




queued, ran




for a bit, then got sent back to the queue) right? Or is




Qdate reset




every time a job returns to the queue, not just the first

time it's

queued up by condor_submit?

Comments? Opinions? Much appreciated.

Ian

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users





_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users





_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users




_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users




_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users