#!/bin/sh
##**************************************************************
##
## Copyright (C) 1990-2012, Condor Team, Computer Sciences Department,
## University of Wisconsin-Madison, WI.
##
## Licensed under the Apache License, Version 2.0 (the "License"); you
## may not use this file except in compliance with the License. You may
## obtain a copy of the License at
##
##
## Unless required by applicable law or agreed to in writing, software
## distributed under the License is distributed on an "AS IS" BASIS,
## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
## See the License for the specific language governing permissions and
## limitations under the License.
##
##**************************************************************
_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS
CONDOR_SSH=`condor_config_val libexec`
CONDOR_SSH=$CONDOR_SSH/condor_ssh
SSHD_SH=`condor_config_val libexec`
SSHD_SH=$SSHD_SH/sshd.sh
. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS
# If not the head node, just sleep forever, to let the
# sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
then
wait
sshd_cleanup
exit 0
fi
EXECUTABLE=$1
shift
# the binary is copied but the executable flag is cleared.
# so the script have to take care of this
chmod +x $EXECUTABLE
# Set this to the bin directory of MPICH installation
#MPDIR=/home/user/local/mpich2-install/bin
MPDIR=/nfs/dir1/
PATH=$MPDIR:.:$PATH
export PATH
export P4_RSHCOMMAND=$CONDOR_SSH
CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE
# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines
cat machines
## run the actual mpijob
mpirun -np $_CONDOR_NPROCS -machinefile machines $EXECUTABLE $@
sshd_cleanup
rm -f machines
exit $?
_______________________________________________Vikrant,One way you can go about having parallel universe jobs fill slots on machines in a depth-first order is to have your machines advertise some sequence of numbers (one unique number per machine) in an attribute in theÂstartd classads and to use a rank _expression_ in the submit file to target that attribute.For example, your execute machines could have...condor_config.local on machineA:PARALLEL_RANK = 100STARTD_ATTRS = $(STARTD_ATTRS) PARALLEL_RANKcondor_config.local on machineB:PARALLEL_RANK = 99STARTD_ATTRS = $(STARTD_ATTRS) PARALLEL_RANKcondor_config.local on machineC:PARALLEL_RANK = 98STARTD_ATTRS = $(STARTD_ATTRS) PARALLEL_RANKand then your parallel universe job's submit file could have...rank = TARGET.PARALLEL_RANKThe dedicated scheduler will try to match your job to slots where the rank _expression_ is highest first, so machineA would have its slots filled first, then machineB, then machineC, and so on.JasonOn Thu, Aug 6, 2020 at 12:06 AM <ervikrant06@xxxxxxxxx> wrote:Hi Jason,ÂThanks for your response.Problem is that it seems like with machine_count HTCondor follows the pattern of filling breadth first instead of depth. To fill the depth first we reduce the machine_count and increase the request_cpus but that impacts the RANK count.ÂI am looking for a way to fill the depth first of the pool without impacting RANK.ÂThanks & Regards,Vikrant Aggarwal_______________________________________________On Tue, Aug 4, 2020 at 8:45 PM Jason Patton <jpatton@xxxxxxxxxxx> wrote:Hi Vikrant,Despite its name, "machine_count" does not necessarily have to do with the number of physical/virtual machines that condor will schedule your job on. "machine_count" tells condor the total number of *slots* that the job should occupy. Suppose you have a job with machine_count = 4... if you have 4 open slots on a single machine in your condor pool, your entire "machine_count = 4" job may be scheduled on that single machine. In that case, mp1script will run mpirun with 4 CPU ranks, but all the ranks will be on a single machine.(The name "machine_count" is a bit outdated, going back to the days where there was usually only one CPU core per machine in a typical condor pool.)Hopefully this helps, though I may have misunderstood your question.Jason Patton_______________________________________________On Tue, Aug 4, 2020 at 3:17 AM <ervikrant06@xxxxxxxxx> wrote:HelloÂExperts,I was not able to find information from docs which can help me with my queries.ÂAny input is highlyÂappreciated.Thanks & Regards,Vikrant Aggarwal_______________________________________________On Wed, Jul 29, 2020 at 6:58 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:Hello Experts,Any thoughts..Thanks & Regards,Vikrant AggarwalOn Mon, Jul 27, 2020 at 4:24 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:Hello Condor Experts,
We are running parallel jobs in a cloud environment using MPICH implementation mp1script. We wanted to pack the parallel job to minimum hosts to avoid cost in the cloud. We have used machine_count and request_cpus to achieve it but changing machine_count directly impacts the RANK of jobs. We wanted to keep RANK of jobs at a higher value. TBH, I am not sure about the advantage of it. Please enlighten me if anyone has information about the usage of RANK.
While going through the documentation I found.
The macro $(Node) is similar to the MPI rank construct
How could we achieve both keeping the MPI jobs on a minimal number of hosts and with higher RANK value?
Regards,
Vikrant Aggarwal
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/