Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Understanding Condor's "Claimed" state

Date: Wed, 29 Jun 2011 15:45:08 -0500 (CDT)
From: Steven Timm <timm@xxxxxxxx>
Subject: Re: [Condor-users] Understanding Condor's "Claimed" state

On Wed, 29 Jun 2011, Jeff Ramnani wrote:

Hello,
I have a Condor pool with 10 dedicated compute nodes, and I'm having an issuegetting people's jobs scheduled the way I want. Here's what's happening.
If user1 has submitted a large batch of jobs, then users that submit jobsafter them aren't getting scheduled until after the first user's jobs arecompleted, even if the users who came later have better priorities.
One difference I've seen when this happens is that user1 who submits thelarge batch of jobs does so as one job cluster that contains many jobs (inthis example, let's say 100 jobs. e.g. 1.0 .. 1.99), and user2 who has abetter priority, but submits later, does so as many clusters with one jobeach (e.g. 2.0, 3.0, 4.0).
Here's an example output of condor_q:

ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
54713.0 user1 6/29 14:00 0+00:02:27 R 0 0.0 user1_job.py2011.054713.1 user1 6/29 14:00 0+00:02:19 R 0 122.1 user1_job.py2011.054713.2 user1 6/29 14:00 0+00:02:14 R 0 0.0 user1_job.py2011.054713.3 user1 6/29 14:00 0+00:02:06 R 0 0.0 user1_job.py2011.0
...
54713.99 user1 6/29 14:00 0+00:00:00 R 0 0.0 user1_job.py2011.0
54488.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh
54489.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh
54490.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh
54491.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh
54492.0    user2         6/29 12:03   0+00:00:00 I  0   732.4 user2_job.sh
User2 has a better priority, so I would expect user2's job 54488.0 to bescheduled on the first available machine when one of user1's jobs iscompleted, but that's not what's happening. It seems like user1 has a"claim" on the machines that lasts longer than an individual job. I've readthe following manual pages:
http://www.cs.wisc.edu/condor/manual/v7.4.4/2_7Priorities_Preemption.html
http://www.cs.wisc.edu/condor/manual/v7.4.4/3_4User_Priorities.html
http://www.cs.wisc.edu/condor/manual/v7.4.4/3_5Policy_Configuration.html
and I'm still not 100% sure I understand how jobs are scheduled in thissituation. I've found the configuration setting for CLAIM_WORKLIFE in themanual which states, "If provided, this expression specifies the number ofseconds during which a claim will continue accepting new jobs." This leadsme to the following questions.
* How long does a user's "claim" last on a machine?


 Others know more technical detail here but if CLAIM_WORKLIFE
is not set, the claim is infinite unless preemption kicks the user off.
If you set CLAIM_WORKLIFE, then it is just the duration of the claim
that you specify.

* Does a job cluster keep a "claim" open on a machine until all its jobs arecompleted?


The size of the cluster doesn't make a difference.
but if a node is claimed for a single user then as long as that user
has jobs in the queue it will keep on executing the jobs of that user,
whether in one cluster or many.

CLAIM_WORKLIFE is a very valuable setting.  I set it to 3600 seconds.
I've never quite understood why the default is infinity.

Steve

Any help is appreciated.

Sincerely,

Jeff Ramnani
This e-mail and any attachments may contain information that is confidentialand proprietary and otherwise protected from disclosure. If you are not theintended recipient of this e-mail, do not read, duplicate or redistribute itby any means. Please immediately delete it and any attachments and notify thesender that you have received it in error. Unintended recipients areprohibited from taking action on the basis of information in this e-mail orany attachments. The DRW Companies make no representations that this e-mailor any attachments are free of computer viruses or other defects.


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Group Leader.
Lead of FermiCloud project.

References:
- [Condor-users] Understanding Condor's "Claimed" state
  - From: Jeff Ramnani

Prev by Date: [Condor-users] Understanding Condor's "Claimed" state
Next by Date: [Condor-users] DAGMAN node priorities vs condor_prio
Previous by thread: [Condor-users] Understanding Condor's "Claimed" state
Next by thread: [Condor-users] DAGMAN node priorities vs condor_prio
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Understanding Condor's "Claimed" state