[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Parallel Universe on Kubernetes



Hi Greg,

 

Thank you for your response! And apologies for the delay -I’ve been immersed in a lot of things.

 

Did you look at my gist? I already followed that section of documentation that you linked and set up the dedicated scheduler and told the execute nodes about it, and the most basic parallel example does not work. I’ve created an issue here:

 

https://github.com/bbockelm/htcondor-autoscale-manager/issues/1

 

In case we can chat off of this mailing list, and here again is the original gist.

 

https://gist.github.com/vsoch/2073136f0833983efc92b4eeb52d49dd

 

TLDR: I am able to generate a token, OR use the password auth (that works for basic jobs), and really I just want to understand what a working example of configs should look like for parallel. I’ve read and tried most of what I found in the docs and haven’t gotten anything functioning yet – I would never ask for help without giving it a fair effort (and documenting that effort) myself. I can send you whatever information / configs that aren’t in those logs about my cluster to further help. Thanks!

 

Best,

 

Vanessa

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Date: Friday, June 23, 2023 at 8:14 AM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Parallel Universe on Kubernetes

 

On 6/19/23 5:45 PM, Sochat, Vanessa via HTCondor-users wrote:

Hi Folks!

 

I’m the HPC monkey of which he speaks! I’ve been creating a Kubernetes Operator with HTCondor as the scheduler, and doing fairly well up until I needed to use the parallel universe. To not clutter your inboxes, here is a summary of where I am currently at!

 

https://gist.github.com/vsoch/2073136f0833983efc92b4eeb52d49dd

 

TLDR: if we could easily adopt the current setup with the docker images here https://github.com/htcondor/htcondor/tree/main/build/docker/services to allow for this parallel universe, that would likely be the example that I need to get it working in Kubernetes. The current working (for basic jobs) setup is here: https://github.com/converged-computing/htcondor-operator and my (so far) failed attempts are under the single opened PR to add LAMMPS. I’m happy to show you / debug anything you might be interested in. Thanks again for your help, and apologies for my noob-level expertise – I’m only about a day into using this beastie!

 

That's very impressive for a day's work.  If you can stand on HTCondor k8s, configuring it to run parallel jobs with a dedicated scheduler is pretty straightforward -- just point the startds at the one schedd that can run parallel jobs, as described here: https://htcondor.readthedocs.io/en/latest/admin-manual/setting-up-special-environments.html?highlight=dedicated%20scheduler#selecting-and-setting-up-a-dedicated-scheduler

 

You may need to be careful to pick an implementation of MPI that works well with your application, we can't help you there.

 

-greg