Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Training large model

Date: Tue, 18 Feb 2025 14:54:33 +0000
From: Dudu Handelman <duduhandelman@xxxxxxxxxxx>
Subject: [HTCondor-users] Training large model

Hi everyone,

I have a user who submits a single job that requires 64 GPUs across multiple servers, with communication between jobs handled by default PyTorch. Our goal is to optimize resource usage.

Right now, the user submits 64 separate jobs, and the main job only begins once all 64 have connected to the PyTorch master. The problem arises when fewer than 64 GPUs are available—say, only 63—forcing the system to wait (sometimes for days) for the missing GPU.

I considered using preemption to address this, but with 127 GPUs, the user might end up holding all the resources while preventing any secondary jobs from starting.

Does anyone have suggestions on how to achieve better resource management?

Thanks,
David

Follow-Ups:
- Re: [HTCondor-users] Training large model
  - From: Beyer, Christoph

Prev by Date: Re: [HTCondor-users] increase time to transfer output files in startd before it thinks it's hung?
Next by Date: Re: [HTCondor-users] Training large model
Previous by thread: Re: [HTCondor-users] increase time to transfer output files in startd before it thinks it's hung?
Next by thread: Re: [HTCondor-users] Training large model
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] Training large model