Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] MPI jobs not executing
- Date: Sun, 5 Feb 2006 17:02:54 -0800 (PST)
- From: "Junaid N. Sahibzada" <sjunaidn@xxxxxxxxx>
- Subject: [Condor-users] MPI jobs not executing
Hi all,
I will give a step by step narration of what i have done so that you can tell where i am making a mistake.
1. I changed the local config files of all the compute nodes. so all the dedicated nodes have the following local config file
CONDOR_HOST = caudate-nh.nsw.cmis.csiro.au
RELEASE_DIR = /usr/local/condor
LOCAL_DIR = /home/condor/
CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
UID_DOMAIN = nsw.cmis.csiro.au
FILESYSTEM_DOMAIN = nsw.cmis.csiro.au
CONDOR_IDS = 000.0
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
LOCK = /tmp/condor-lock.$(HOSTNAME)0.874095049061911
DAEMON_LIST = MASTER, SCHEDD, STARTD
JAVA = /usr/bin/java
##--------------------------------------------------------------------
## 2) Always run jobs, but prefer dedicated ones
##--------------------------------------------------------------------
START = True
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
WANT_SUSPEND = False
WANT_VACATE = False
RANK = Scheduler =?= $(DedicatedScheduler)
MPI_CONDOR_RSH_PATH = $(LIBEXEC)
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
And the dedicated submit machine has the following local config file
## What machine is your central manager?
CONDOR_HOST = caudate-nh.nsw.cmis.csiro.au
## Pathnames:
## Where have you installed the bin, sbin and lib condor directories?
RELEASE_DIR = /usr/local/condor
## Where is the local condor directory for each host?
## This is where the local config file(s), logs and
## spool/execute directories are located
LOCAL_DIR = /home/condor/
## Mail parameters:
## When something goes wrong with condor at your site, who should get
## the email?
CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
## Network domain parameters:
## Internet domain of machines sharing a common UID space. If your
## machines don't share a common UID space, set it to
## UID_DOMAIN = $(FULL_HOSTNAME)
## t!
o specify
that each machine has its own UID space.
UID_DOMAIN = nsw.cmis.csiro.au
## Internet domain of machines sharing a common file system.
## If your machines don't use a network file system, set it to
## FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
## to specify that each machine has its own file system.
FILESYSTEM_DOMAIN = nsw.cmis.csiro.au
## The user/group ID <uid>.<gid> of the "Condor" user.
## (this can also be specified in the environment)
## Note: the CONDOR_IDS setting is ignored on Win32 platforms
CONDOR_IDS = 000.0
LOCK = /tmp/condor-lock.$(HOSTNAME)0.597654629106732
## condor_master
## Daemons you want the master to keep running for you:
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx!
o.au"
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
## Java parameters:
## If you would like this machine to be able to run Java jobs,
## then set JAVA to the path of your JVM binary. If you are not
## interested in Java, there is no harm in leaving this entry
## empty or incorrect.
JAVA = /usr/bin/java
UNUSED_CLAIM_TIMEOUT = 600
START = Owner == "sah006" || Owner == "condor"
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
these are the changes i have made to the compute nodes and to the dedicated submit node.
I have submiited mpi jobs but they are not being executed.
here is my submit file
universe = parallel
executable = a.out
log = logfile
error = log.error
output = log.output
machine_count = 4
queue
the program is a simple program which i have copy pasted from a website. it runs and compiles perfectly from the command line.
now can any one tell me what is the problem?
and by the way do i have to start an mpd ring before i send jobs to condor?
i have tried both ways. its not working
regards
Junaid N. Sahibzada
Cell # (+61) 404 998 494
284/9 Crystal St. Waterloo, 2017, NSW, Australia
International Student MSc Internetworking, UTS, Australia
Bachelor of Information Technology, NUST, Pakistan
Brings words and photos together (easily) with
PhotoMail - it's free and works with Yahoo! Mail.