[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Htcondor with python ray



Ray is deeply tied into kubernetes as a resource manager, see trying to use Kubernetes style horizontal scaling. You would need to build an system to have Ray communicate with HTCondor rather than k8s. There is also a question about job efficiency. For example, running dask on top of HTCondor was very inefficient initially.Â

Benedikt

On Tue, Nov 25, 2025 at 12:50âPM Michael DiDomenico <mdidomenico4@xxxxxxxxx> wrote:
On Tue, Nov 25, 2025 at 6:40âPM Ram Ban <ramban046@xxxxxxxxx> wrote:
>
> Yeah that's the option in ray, but it's not recommended to run ray start multiple time on same machine
> In dynamic slots, if 2 jobs of ray cluster starts on same machine they might crash(not tested)
> Also ray stop doesn't have option to check for different workers, it will stop both workers in downscale, which might result in crashing of some jobs in other slot.

you have to use /tmp and /dev/shm isolation. but agreed, it's not recommended

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/


--
Benedikt Riedel
IceCube Neutrino Observatory
Accelerated AI Algorithms for Data-Driven Discovery
University of Wisconsin-Madison