Hi everyone,
I have a user who submits a single job that requires 64 GPUs across multiple servers, with communication between jobs handled by default PyTorch. Our goal is to optimize resource usage.
Right now, the user submits 64 separate jobs, and the main job only begins once all 64 have connected to the PyTorch master. The problem arises when fewer than 64 GPUs are available—say, only 63—forcing the system to wait (sometimes for days) for the missing
GPU.
I considered using preemption to address this, but with 127 GPUs, the user might end up holding all the resources while preventing any secondary jobs from starting.
Does anyone have suggestions on how to achieve better resource management?
Thanks,
David |