Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Whole System Scheduling
- Date: Fri, 23 Oct 2009 11:01:20 -0500
- From: Ioan Raicu <iraicu@xxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Whole System Scheduling
Have you tried glide-ins? Essentially, users use pilot jobs to do
resource provisioning (e.g. a job that requires 1000 CPUs for 24 hours),
and then once their job (of 1000 CPUs) goes in a running state, jobs can
be submitted directly to the worker nodes bypassing the main queue. In
essence, you allocate resources initially in coarse granularity, say the
entire site, and then the user uses those allocated resources in finer
grained quantities. This is also called multi-level scheduling in some
communities. I believe you can use native Condor tools to do this, but I
have never tried. I did use several other tools (MyCluster and Falkon)
to achieve this kind of functionality. MyCluster can setup a Condor
cluster within another cluster, so that is likely your most transparent
solution that doesn't involve the users to rewrite any code, if their
apps already work with Condor. Falkon can offer some performance and
scalability improvements over Condor (and other LRMs) for certain
workloads, but it requires that the apps use the Falkon API to submit
jobs and listen for notifications.
Cheers,
Ioan
Jonathan D. Proulx wrote:
Hi All,
I've been trying to get whoel system scheduling working on my pool
for some months now and it is becoming a rather critial issue.
I've been basing my config off of http://nmi.cs.wisc.edu/node/1482
Ideally I'd like
1) Whole system jobs _must_ not run untill they have the whole system
2) Non "PriorityGroup" (predefined in config) jobs _should_ be
preempted when a "PriortyGroup" whole system job is scheduled
3) Whole system jobs _should_ be suspended untill all single slot
"PriorityGroup" jobs complete
Point one is critical as much of the code users are looking to
schedule in this way is benchmark code that is only meaningful if the
rest of the system is quiescent.
Igoring non Priority group users for now and trying to simply have
suspend the whole system job untill all other slots are clear fails at
MaxSuspendTime, which is understandable except the job does execute
for some period of time before being killed and requeued (usually in
the exact same slot)
Though if #2 and/or #3 are difficult
4) All single slot jobs _may_ be preempted by a whole system job
Trying #4 I see that the whole system job gets schedule on slot one
and starts running. Slots 2 through N continue executing for 3min (a
negotiation cycle?) before they exit.
My fondest wish would be for Condor to be able to allocate multiple CPUs and
jobs could simply require some number (which they could if I
configured a matrix of mutually exlusive slots I guess but as we get
up in to the world of 16 and more cores this gets crazy)
Help?
-Jon
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
--
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384
Evanston, IL 60208-3118
=================================================================
Cel: 1-847-722-0876
Tel: 1-847-491-8163
Email: iraicu@xxxxxxxxxxxxxxxxxxxxx
Web: http://www.eecs.northwestern.edu/~iraicu/
https://wiki.cucis.eecs.northwestern.edu/
=================================================================
=================================================================