Re: [HTCondor-users] How to ask htcondor to wail till all jobs finished in Vanilla Universe

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

On Thu, Dec 29, 2022 at 8:25 PM Jason Patton via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Hi Gagan,

HTCondor's parallel scheduling is only possible by setting up and using parallel universe (which requires the admin to set up execution points with a "dedicated scheduler"Âhttps://htcondor.readthedocs.io/en/latest/users-manual/parallel-applications.html#parallel-jobs-and-the-dedicated-scheduler ). However, if you know that you are only ever going to run your MPI jobs on one execution point at a time, the better way to handle that case is by making sure your execution points are set up to use partitionable slots and then running your MPI jobs as single jobs that request multiple cores (e.g. request_cpus = 8). mpirun should be able to tell the number of cores your job has been given automatically, but if not, you can use any of the environment variables that HTCondor (OMP_NUM_THREADS, etc.) sets to be the same as the number of cores to pass to mpirun, e.g. "mpirun -np $OMP_NUM_THREADS my_mpi_job".

Jason Patton

On Thu, Dec 29, 2022 at 5:41 AM gagan tiwari <gagan.tiwari@xxxxxxxxxxxxxxxxxx> wrote:
HI Guys,
Â Â Â Â Â Â Â Â Â Â Please advise. Is this achievable using the Vanilla universe or will I need to switch to a parallel universe ?

Thanks,
Gagan

On Thu, Dec 29, 2022 at 12:12 PM gagan tiwari <gagan.tiwari@xxxxxxxxxxxxxxxxxx> wrote:
Hi Guys,
Â Â Â Â Â Â Â Â I have an executeÂserver with 8 coresÂand I am trying to run MPI jobs inÂVanilla Universe on the execute server with one job on eachÂcore.
I have been able to make them start successfullyÂ on that execute server by using following attribute on execute server condor config:-Â

NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = True

But the issue is condor doesn't wait for all jobs to finish and kills all jobs running on different cores on that single executeÂserverÂ as soon as one of the jobs is finished.Â

I have tried usingÂÂ+ParallelShutdownPolicy = "WAIT_FOR_ALL"Â in the job submit file but that also didn't help.

Someone please help me how to fix this issue. It's a bit urgent.Â

Thanks,
Gagan

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] How to ask htcondor to wail till all jobs finished in Vanilla Universe