Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Adjusting machine RANK classad expr based on totalqueue time for a job

Date: Thu, 28 Oct 2004 10:13:05 -0500
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Adjusting machine RANK classad expr based on totalqueue time for a job

Ian Chesal wrote:

If the negoiator is simply connecting a startd with a schedd then is there something amiss with the schedd when condor_rm is invoked? I would have expected 44.2 to run after 44.0 whether the negotiator or the schedd was deciding which job to had to the startd next.

The schedd doesn't apply any knowledge about the machine RANK when deciding which job to run next on a claim, so there can be a difference in how jobs are scheduled, depending on whether the schedd can run multiple jobs per claim or only one job per claim. (Condor schedd experts please correct me if I am wrong.)

However, your test seems to be showing that a retiring claim is allowed to run another job when the existing job is removed via condor_rm. This is clearly a bug. I'll look into it.

Thanks for the prying into the situation!

--Dan

Ian
-----Original Message----- From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley Sent: October 27, 2004 6:16 PM To: Condor-Users Mail List Subject: Re: [Condor-users] Adjusting machine RANK classad exprbased ontotalqueue time for a job

Ian,

Could you specify which of the jobs in your various tests are being run by different users, if any? One potential point of confusion is that, by design, the Condor negotiator does not micromanage what the schedd does with a claim. Once the schedd gets a claim on behalf of a user, it will continue to run jobs on that claim until the claim is taken away or the user runs out of jobs. The negotiator doesn't tell the schedd which job to run next on the claim.

You can force renegotiation of claims after every job if you want. Something like the following policy will do this:
MaxJobRetirementTime = 1000000
WANT_SUSPEND = FALSE
PREEMPT = TRUE
--Dan

Ian Chesal wrote:

It looks like it was my use of condor_rm that messed up my predictability. I continued the experiment but this time I made sure the running 44.1 process finished normally instead of being

pre-maturly
terminated by condor_rm.

I had two queued jobs with their EnteredCurrentStatus times:
44.2 1098912677
44.3 1098910808
I expected 44.2 to rank lower than 44.3 by ~31. So 44.3
should be the

next job picked up.

And this was the case. My rank expression worked this time.

Excellent.

So here's a question for the condor team: If I was a "sneaky user" I could write a job that, after processing was complete sent

me an email

and then went to sleep for a long, long time. Upon receiving that email, if I used condor_rm to terminate the job I'd be able

to hang on

to the resource it was using and run another job on it. Even

if another

job, from another user, had a higher rank because condor_rm seems to prevent the machine from re-negotiating. This would give me infinite access to a resource. Can this happen?

Ian

-----Original Message----- From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal Sent: October 27, 2004 5:18 PM To: Condor-Users Mail List Subject: RE: [Condor-users] Adjusting machine RANK classad

expr based
ontotalqueue time for a job

Hmm. So I went with the RANK expression:
RANK = ((TARGET.JobStatus =?= 1) * ((CurrentTime -
TARGET.EnteredCurrentStatus)/60))
My plan was to make sure jobs that are queued rank higher
the longer

they've been in the queued state. In this case, +1 for every minute they've been sitting idle.

To test this I submitted some jobs in the held state. Jobs

are simple:

go to the machine and sleep for an hour.

I released three of the held jobs. My machine immediately picked up 44.0 from the cluster and started running.

I let the other two released jobs build up some queue time

while 44.0

slept on a machine. At one point I did see condor_status

show my 44.0

as being in the "Retiring" state instead of the "Busy"

state -- that
is good news. We have a long MaxJobRetirementTime so this is expected.

I let about 8 minutes lapse I then I issued the commmand:
condor_hold 44.1
condor_release 44.1
So this reset the EnteredCurrentStatus time on 44.1. I now
have 44.0
running, but retiring and the remaining two jobs each have EnteredCurrentStatus as follows:
44.1 1098910859
44.2 1098910279
By this output I expect 44.2 to have the higher rank. 44.0 is still running so I removed it with:

condor_rm 44.0

I expected the machine to pick up 44.2 as the next job because it's rank is higher, having been queued for a longer time that 44.1.

Not so. The machine picked up 44.1. I'm the only user in
the system so

it's not a matter of EUP. What's up? Why is it 44.2 didn't rank higher? Can anyone see how I messed up my prediction for next job

to run? I'm
stumped. I thought I had it all figured out.

Thanks!

Ian

-----Original Message----- From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal Sent: October 27, 2004 11:34 AM To: Condor-Users Mail List Subject: [Condor-users] Adjusting machine RANK classad expr

based on
totalqueue time for a job

I'm toying with adjusting the RANK expression to achieve a more FIFO-like consideration when condor runs jobs. The idea is to rank jobs on machines based on their time in the queue. I wanted to bounce the rank expression and idea off the list. The rank expression for machines I'm thinking of using is:
RANK = ((TARGET.JobStatus =?= 1) * ((CurrentTime -
TARGET.EnteredCurrentStatus)/600))
This would give a job queued 10 minutes longer than another job a higher rank on the machine.

The other option is:

RANK = ((CurrentTime - TARGET.QDate)/600)

But this would track cumulative queue time (so if the job
queued, ran

for a bit, then got sent back to the queue) right? Or is

Qdate reset

every time a job returns to the queue, not just the first
time it's
queued up by condor_submit?

Comments? Opinions? Much appreciated.

Ian
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users
_______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx http://lists.cs.wisc.edu/mailman/listinfo/condor-users

References:
- RE: [Condor-users] Adjusting machine RANK classadexprbased ontotalqueue time for a job
  - From: Ian Chesal

Prev by Date: RE: [Condor-users] Missing Questions in Windows 6.6.7 installer?
Next by Date: [Condor-users] Condor_pool hawkeye module error "Recustion detectedin Hawkeye module"
Previous by thread: RE: [Condor-users] Adjusting machine RANK classadexprbased ontotalqueue time for a job
Next by thread: [Condor-users] Condor_pool hawkeye module error "Recustion detectedin Hawkeye module"
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Adjusting machine RANK classad expr based on totalqueue time for a job