Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)
- Date: Wed, 18 Mar 2026 12:59:55 +0000
- From: Miron Livny <miron@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)
Emily,
What do you expect to happen if the second job fails? Do you expect the data to stay there waiting for a rerun?
In other words who will remove the training data?
Miron
Sent from my iPhone
> On Mar 18, 2026, at 07:54, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
>
> ïHi,
>
> there are a lot of options to do so more or less elegant with condor on-board ressources.
>
> E.g. you could allow only one of theses jobs at a time per type and then put the claime-liftime on the GPU node very high this would automatically run these jobs one after another on this one node ...
>
> Best
> christoph
>
>
>
> --
> Christoph Beyer
> DESY Hamburg
> IT-Department
>
> Notkestr. 85
> Building 02b, Room 009
> 22607 Hamburg
>
> phone:+49-(0)40-8998-2317
> mail: christoph.beyer@xxxxxxx
>
> ----- UrsprÃngliche Mail -----
> Von: "Erik Kooistra" <a66@xxxxxxxxx>
> An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
> Gesendet: Mittwoch, 18. MÃrz 2026 12:59:34
> Betreff: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)
>
> Hi All,
>
> At NIKHEF we are seeing more and more GPU usage, and often those jobs
> have quite a long preprocessing stage to reformat the training inputs or
> other data files takes quite some time compared to the total runtime of
> the slot.
>
> As a result of this we end up with slots that do request a GPU but don't
> actually use it for a big part of the claimed period what results in non
> optimal GPU usage.
>
> Now copying this prepared data back to a network storage, and then copy
> it back to the scratch disk of the slot with a GPU is a bit waist full
> of network bandwidth.
>
> So i was wondering if it would be possible to have in a DAG or some
> other way, a CPU intensive preproccessing job running on a node with a
> GPU, and later in the process attaching the GPU to this slot or having
> a way to have a internal copy between the two jobs.
>
> Any other suggestions that would work with the current limitations of
> condor are also more then welcome, (by for example having a node local
> scratch and having some constraints the jobs run after each other, altho
> then you miss the cleanup that condor does of the scratch)
>
> Emily Kooistra
> NIKHEF
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
>
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
>
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/