Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] counting licenses

Date: Fri, 06 May 2005 07:59:34 -0700
From: Joshua Kolden <joshua@xxxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] counting licenses

In a future release the plan appears to be that local disk space is to
be handled by the starter enforcing constraints and killing the job if
it violates them. remote disk space is the purview of the quota system
of the filesystem...

There seems to be code relating to sophisticated network management
built into condor but not enabled yet - I don't know if this is
something not ready for prime time...

Condor is very much about the individual users having a reasonable awareness of the impact of their jobs on the wider world and throttling as they see fit.

Thanks for the info. I might check out some of the options you propose. We've found in large facility queue systems such as visual effects that globally monitored resources are very important. In simple setting s a user can control their own impact on the queue, but in large facilities that becomes much more difficult to predict. In fact I'd say it's not even the size of the facility but rather the throughput of a facility vs. the number of users. The higher the ratio the more difficult it is to manually manage global resources. In many places they simply assign a limited number of cpus per artist to run on, but this seems to me to surcomvent automation of the queue. Alfred from Pixar, although not the best queue software, has a 'ping' system which allows one to run any command before a job is started, very easy to implement, and very effective for global management.

Some systems that don't offer global resource monitoring do allow you to return a failure from a job that is understood to mean try again in a little bit. Such as a license failure return code. Such a failure, causes the job to no try to submit a new task for a set amout of time, or until an exsisting task finishes. Unlike a normal failure, a resource failure return code never causes the task to be marked failed, it just keeps trying until it gets the resource. If there is not such a system in Condor I would strongly encourage it's addition. It's the state of the art for visual effects queues circa 1995. It doesn't solve the submition logic quite like we need, but it's better then nothing.

Thanks,
j

Follow-Ups:
- Re: [Condor-users] counting licenses
  - From: John Wheez
- Re: [Condor-users] counting licenses
  - From: Matt Hope

References:
- [Condor-users] counting licenses
  - From: Joshua Kolden
- Re: [Condor-users] counting licenses
  - From: Miron Livny
- Re: [Condor-users] counting licenses
  - From: Joshua Kolden
- Re: [Condor-users] counting licenses
  - From: John Wheez
- Re: [Condor-users] counting licenses
  - From: Joshua Kolden
- Re: [Condor-users] counting licenses
  - From: Matt Hope
- Re: [Condor-users] counting licenses
  - From: Joshua Kolden
- Re: [Condor-users] counting licenses
  - From: Matt Hope

Prev by Date: Re: [Condor-users] Re: Newbie questions
Next by Date: RE: [Condor-users] Re: Newbie questions
Previous by thread: Re: [Condor-users] counting licenses
Next by thread: Re: [Condor-users] counting licenses
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] counting licenses