Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] BOINC jobs on compute nodes (with GPU)
- Date: Mon, 25 Apr 2016 13:44:58 +0200
- From: Carsten Aulbert <Carsten.Aulbert@xxxxxxxxxx>
- Subject: [HTCondor-users] BOINC jobs on compute nodes (with GPU)
Hi all
(if the introductory part is too long, please just skip to the part
marked by ####)
we are just about to redo our grown HTCondor config and are revisiting
running HTCondor and BOINC again.
At the moment I think the only supported way of running BOINC under
HTCondor's umbrella is a backfill, which will only ever kick in, if the
node as an idle slot/core (static model) or is completely idle (i.e. if
the node is configured to be fully partition-able).
Given that we have a number of multi-core jobs, I think going back to a
static layout is a no-go as is waiting until all cores of a given system
are idle.
This currently leaves only two alternatives I can think of.
(1) A special user submits condor jobs into the pool with a very bad
priority and the pool is configured to evict these jobs as soon as
needed. Within this framework, I think it should also be possible to
submit GPU and CPU jobs in parallel.
However, managing this centrally with proper copying of files is
potentially a nightmare even if HTCondor will do that for us as one
would need to ensure proper locking and so on.
(2) The easier approach - which is what we are currently using - is
starting the BOINC client on the system independent of HTCondor but
limit it via cgrgoups to 1/1024 of a core so it will only ever get
cycles whenever there are idle cycles.
This approach works surprisingly well, however, as this is outside of
HTCondor, I don't dare to occupy the GPU as I would have no idea, when
condor_startd would start a job on it.
Thus my question:
####
Is there a hook within HTCondor's startd on nodes partitionable slots
which we could use to launch a script/interact with BOINC via
API/boincmd to stop and start a GPU job whenever the resource is unused
by HTCondor?
Cheers
Carsten
--
Dr. Carsten Aulbert, Atlas cluster administration
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
CallinstraÃe 38, 30167 Hannover, Germany
Tel: +49 511 762 17185, Fax: +49 511 762 17193