Re: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)

Hi Miron,

Good question, purging the data when a job fails would be fine it does
require a full rerun but that should be okay.

Or alternatively, have the job copy them out back to a specified
destination to be able to restore from that.

Emily

On 3/18/26 13:59, Miron Livny via HTCondor-users wrote:
> Emily,
>
> What do you expect to happen if the second job fails? Do you expect the data to stay there waiting for a rerun?
>
> In other words who will remove the training data?
>
> Miron
>
>
> Sent from my iPhone
>
>> On Mar 18, 2026, at 07:54, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
>>
>> ïHi,
>>
>> there are a lot of options to do so more or less elegant with condor on-board ressources.
>>
>> E.g. you could allow only one of theses jobs at a time per type and then put the claime-liftime on the GPU node very high this would automatically run these jobs one after another on this one node ...
>>
>> Best
>> christoph
>>
>>
>>
>> --
>> Christoph Beyer
>> DESY Hamburg
>> IT-Department
>>
>> Notkestr. 85
>> Building 02b, Room 009
>> 22607 Hamburg
>>
>> phone:+49-(0)40-8998-2317
>> mail: christoph.beyer@xxxxxxx
>>
>> ----- UrsprÃngliche Mail -----
>> Von: "Erik Kooistra" <a66@xxxxxxxxx>
>> An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
>> Gesendet: Mittwoch, 18. MÃrz 2026 12:59:34
>> Betreff: [HTCondor-users] Schedule jobs after each other on same node with shared scratch (GPU preproccessing)
>>
>> Hi All,
>>
>> At NIKHEF we are seeing more and more GPU usage, and often those jobs
>> have quite a long preprocessing stage to reformat the training inputs or
>> other data files takes quite some time compared to the total runtime of
>> the slot.
>>
>> As a result of this we end up with slots that do request a GPU but don't
>> actually use it for a big part of the claimed period what results in non
>> optimal GPU usage.
>>
>> Now copying this prepared data back to a network storage, and then copy
>> it back to the scratch disk of the slot with a GPU is a bit waist full
>> of network bandwidth.
>>
>> So i was wondering if it would be possible to have in a DAG or some
>> other way, a CPU intensive preproccessing job running on a node with a
>> GPU, and later in the process attaching the GPU to this slot or having
>> a way to have a internal copy between the two jobs.
>>
>> Any other suggestions that would work with the current limitations of
>> condor are also more then welcome, (by for example having a node local
>> scratch and having some constraints the jobs run after each other, altho
>> then you miss the cleanup that condor does of the scratch)
>>
>> Emily Kooistra
>> NIKHEF
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>>
>> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>>
>> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
>
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/