Hi Dudu,
always good to hear from you !
I scratched my head a while and thought about concurrency limits to solve your issue but that is not obvious to me how it should work with those alone.
Maybe worht a try - writing a DAG with a ''pilot' job that requests 64 GPUs (using concurrency limits), once it runs it starts all the 'real' jobs and dies. Through the claimed lifetime of the GPU slots you should get them all and be a happy camper :)
But that is only a wild guess on how it could work - there might be a more elegant solution ...
Best
christoph
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
Von: "Dudu Handelman" <duduhandelman@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 18. Februar 2025 15:54:33
Betreff: [HTCondor-users] Training large model
Hi everyone,
I have a user who submits a single job that requires 64 GPUs across multiple servers, with communication between jobs handled by default PyTorch. Our goal is to optimize resource usage.
Right now, the user submits 64 separate jobs, and the main job only begins once all 64 have connected to the PyTorch master. The problem arises when fewer than 64 GPUs are availableâsay, only 63âforcing the system to wait (sometimes for days) for the missing
GPU.
I considered using preemption to address this, but with 127 GPUs, the user might end up holding all the resources while preventing any secondary jobs from starting.
Does anyone have suggestions on how to achieve better resource management?
Thanks,
David
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
Join us in June at Throughput Computing 25: https://osg-htc.org/htc25
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/