Re: [HTCondor-users] Training large model

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hi Dudu,

always good to hear from you !

I scratched my head a while and thought about concurrency limits to solve your issue but that is not obvious to me how it should work with those alone.

Maybe worht a try - writing a DAG with a ''pilot' job that requests 64 GPUs (using concurrency limits), once it runs it starts all the 'real' jobs and dies. Through the claimed lifetime of the GPU slots you should get them all and be a happy camper :)

But that is only a wild guess on how it could work - there might be a more elegant solution ...

Best

christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Von: "Dudu Handelman" <duduhandelman@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 18. Februar 2025 15:54:33
Betreff: [HTCondor-users] Training large model

Hi everyone,

I have a user who submits a single job that requires 64 GPUs across multiple servers, with communication between jobs handled by default PyTorch. Our goal is to optimize resource usage.

Right now, the user submits 64 separate jobs, and the main job only begins once all 64 have connected to the PyTorch master. The problem arises when fewer than 64 GPUs are availableâsay, only 63âforcing the system to wait (sometimes for days) for the missing GPU.

I considered using preemption to address this, but with 127 GPUs, the user might end up holding all the resources while preventing any secondary jobs from starting.

Does anyone have suggestions on how to achieve better resource management?

Thanks,
David

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

Join us in June at Throughput Computing 25: https://osg-htc.org/htc25

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Training large model