Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)

Date: Wed, 18 Mar 2026 16:06:19 +0100
From: Emily Kooistra <a66@xxxxxxxxx>
Subject: Re: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)

Hi Miron,

Its definitely a non trivial problem, I think to be flexible enough andnot have an other "quick" hack.

I did some brainstorming with people and looking what other schedulersand job schedulers offer.


Some suggestions:

- Implement a way to copy outputs between nodes in a DAG without ithaving to go via the scheduler if possible (Altho maybe this is alreadypossible).- Having support for shared filesystems, currently the file systemdomain is not flexible enough for this. Ideally you want somehierarchical way to specify what paths / filesystems are shared betweenwhat subset of nodes.- Have a way to declare a DAG wide scratch space requirements, there aredifferent solutions out there right now for this, BeeOND, Lustre ondemand garage or other S3 based systems as an example.

Ofcourse a big part of this can currently already be put together withPROVISIONER / SERVICE nodes to some degree but this puts quite some loadon the users to make that work.


Emily

On 3/18/26 15:49, Miron Livny via HTCondor-users wrote:

Thank you, Emily ... as you can tell the challenge in this type of casesis what to do in case of a failure ... now, the second job may failbecause the node crashed .. In the case the second job will wait untilthe node comes back (if ever?) ... should we copy the processed trainingdata back to "safe" storage just in case?

We are extremely interested in these AI driven workloads and want to doit "right".


Miron


------------------------------------------------------------------------

*From:* HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf ofEmily Kooistra <a66@xxxxxxxxx>

*Sent:* Wednesday, March 18, 2026 9:00
*To:* htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>

*Subject:* Re: [HTCondor-users] Schedule jobs after each other on samenode with shared scratch (GPU preproccessing)

Hi Miron,

Good question, purging the data when a job fails would be fine it does
require a full rerun but that should be okay.

Or alternatively, have the job copy them out back to a specified
destination to be able to restore from that.

Emily

On 3/18/26 13:59, Miron Livny via HTCondor-users wrote:

Emily,

What do you expect to happen if the second job fails? Do you expect the data to stay there waiting for a rerun?

In other words who will remove the training data?

Miron


Sent from my iPhone

On Mar 18, 2026, at 07:54, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:

ïHi,

there are a lot of options to do so more or less elegant with condor on-board ressources.

E.g. you could allow only one of theses jobs at a time per type and then put the claime-liftime on the GPU node very high this would automatically run these jobs one after another on this one node ...

Best
christoph



--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

----- UrsprÃngliche Mail -----
Von: "Erik Kooistra" <a66@xxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 18. MÃrz 2026 12:59:34
Betreff: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)

Hi All,

At NIKHEF we are seeing more and more GPU usage, and often those jobs
have quite a long preprocessing stage to reformat the training inputs or
other data files takes quite some time compared to the total runtime of
the slot.

As a result of this we end up with slots that do request a GPU but don't
actually use it for a big part of the claimed period what results in non
optimal GPU usage.

Now copying this prepared data back to a network storage, and then copy
it back to the scratch disk of the slot with a GPU is a bit waist full
of network bandwidth.

So i was wondering if it would be possible to have in a DAG or some
other way, a CPU intensive preproccessing job running on a node with a
GPU, andÂ later in the process attaching the GPU to this slot or having
a way to have a internal copy between the two jobs.

Any other suggestions that would work with the current limitations of
condor are also more then welcome, (by for example having a node local
scratch and having some constraints the jobs run after each other, altho
then you miss the cleanup that condor does of the scratch)

Emily Kooistra
NIKHEF
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/ <https://www-

auth.cs.wisc.edu/lists/htcondor-users/>

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/ <https://www-

auth.cs.wisc.edu/lists/htcondor-users/>

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/ <https://www-

auth.cs.wisc.edu/lists/htcondor-users/>

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/ <https://www-auth.cs.wisc.edu/lists/htcondor-users/>


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

References:
- [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)
  - From: Emily Kooistra
- Re: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)
  - From: Beyer, Christoph
- Re: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)
  - From: Miron Livny
- Re: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)
  - From: Emily Kooistra
- Re: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)
  - From: Miron Livny

Prev by Date: Re: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)
Next by Date: [HTCondor-users] User permission(s) delegation?
Previous by thread: Re: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)
Next by thread: [HTCondor-users] User permission(s) delegation?
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)