[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to ask htcondor to wail till all jobs finished in Vanilla Universe



Hi Jason,
         ÂThanksÂfor your email. If you look at my mail, this is exactly what I have set-up. But the issue is as soon as the Main process exit condor terminates other processes which run on other cores. So, I need condor to wait till all jobs running on all cores on an execute host complete.

Thanks,
Gagan

On Thu, Dec 29, 2022 at 8:25 PM Jason Patton via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
Hi Gagan,

HTCondor's parallel scheduling is only possible by setting up and using parallel universe (which requires the admin to set up execution points with a "dedicated scheduler"Âhttps://htcondor.readthedocs.io/en/latest/users-manual/parallel-applications.html#parallel-jobs-and-the-dedicated-scheduler ). However, if you know that you are only ever going to run your MPI jobs on one execution point at a time, the better way to handle that case is by making sure your execution points are set up to use partitionable slots and then running your MPI jobs as single jobs that request multiple cores (e.g. request_cpus = 8). mpirun should be able to tell the number of cores your job has been given automatically, but if not, you can use any of the environment variables that HTCondor (OMP_NUM_THREADS, etc.) sets to be the same as the number of cores to pass to mpirun, e.g. "mpirun -np $OMP_NUM_THREADS my_mpi_job".

Jason Patton

On Thu, Dec 29, 2022 at 5:41 AM gagan tiwari <gagan.tiwari@xxxxxxxxxxxxxxxxxx> wrote:
HI Guys,
          Please advise. Is this achievable using the Vanilla universe or will I need to switch to a parallel universe ?

Thanks,
Gagan

On Thu, Dec 29, 2022 at 12:12 PM gagan tiwari <gagan.tiwari@xxxxxxxxxxxxxxxxxx> wrote:
Hi Guys,
        I have an executeÂserver with 8 coresÂand I am trying to run MPI jobs inÂVanilla Universe on the execute server with one job on eachÂcore.
I have been able to make them start successfully on that execute server by using following attribute on execute server condor config:-Â

NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = True

But the issue is condor doesn't wait for all jobs to finish and kills all jobs running on different cores on that single executeÂserver as soon as one of the jobs is finished.Â

I have tried usingÂÂ+ParallelShutdownPolicy = "WAIT_FOR_ALL"Â in the job submit file but that also didn't help.

Someone please help me how to fix this issue. It's a bit urgent.Â

Thanks,
Gagan




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/