Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Backfill on an OpenStack system
- Date: Mon, 04 Oct 2021 07:20:42 -0700 (PDT)
- From: Marcus Ebert <mebert@xxxxxxx>
- Subject: Re: [HTCondor-users] Backfill on an OpenStack system
Hi Matt,
Sorry, I only now came across this thread.
Let me add one possibility on how to do what you want to the ways on how
to do it shown by others:
At the University of Victoria we developed for the reason you mentioned
Cloudscheduler. What it does is that it looks to an HTCondor instance and
if there are jobs in there determines the job resource requests and then
starts a VM with enough resources on a cloud; if there are no more jobs
then the VMs get terminated.
In this system, jobs see a normal batch system and and the worker nodes
are started on demand by Cloudscheduler; opportunistic usage is possible
in a way that when there are VMs started outside of Cloudscheduler and the
total core usage in the cloud project is above a configured limit, then
Cloudscheduler will automatically retire the VMs (lets jobs finish but
doesn't allow HTCondor to start new jobs) and terminates those VMs once no
more jobs are running on it. Cloudscheduler is fully accessible via web
interface as well as cli.
For reference:
https://link.springer.com/epdf/10.1007/s41781-020-0036-1
(in full support and developing still new features, so some information
from the publication may have changes since then)
It's Opensource and we would be happy to assist anyone in setting up an
own instance:
https://github.com/hep-gc/cloudscheduler
We run this successfully since many years using Openstack systems around
the world and also commercial clouds like Amazon. Jobs we run are mostly
for Physic's experiments like Atlas, Belle-II, and Dune, and we also run
the Cloudscheduler instance as a service for others which then only
provide their own HTCondor instance if wanted.
Cheers,
Marcus
On Sat, 4 Sep 2021, West, Matthew wrote:
Hi All,
Here at Exeter, IT is setting up an OpenStack system to support researchers who want DRAM heavy bespoke workstation-like environments. Because I don't expect the system to be full up with active users 24/7, I am wondering what the optimal way to setup an HTCondor pool on it to run jobs as backfill. Would this be similar to how you would do it for any other spare resources: have a VM start up on a node and announce itself to the collector daemon as an available worker if idle conditions of the machine are met?
It reminds me of the method to expand one's resources into corporate cloud servers but I am not sure what tools are useful in this case.
Cheers,
Matt