Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run
- Date: Fri, 10 Feb 2006 21:39:44 +0000
- From: Jean-Alain Grunchec <jgrunche@xxxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run
Hello,
I was asked by someone who experienced a similar issue how this problem
was solved.
I assumed that I needed only one computer so as to run a single node
(i.e. the machine running the dedicated scheduler would reun a node).
This was wrong. In order to run a single node, you need two machines.
The machine running the dedicated scheduler uses a 'normal'
condor_config.local file. It will submit the jobs but will not run a
node locally.
Something like this (you don't need all the commented lines)
## DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxx"
START = True
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
WANT_SUSPEND = False
WANT_VACATE = False
HIGHPORT = 9700 #Required by my firewall
LOWPORT = 9600 #Required by my firewall
UNUSED_CLAIM_TIMEOUT = 600 #this comes from the example
/usr/local/condor/etc/examples/condor_config.local.dedicated.submit
## RANK = Scheduler =?= $(DedicatedScheduler)
## STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
This isn't the entire file but the remaining part was configured when
condor was installed and hasn't been altered ever since.
The 2nd machine (which will run a node) needs to be configured so that
it will use the dedicated scheduler on the first machine. The
condor_config file looks like this (see example
/usr/local/condor/etc/examples/condor_config.local.dedicated.resource )
:
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxx"
START = True
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
WANT_SUSPEND = False
WANT_VACATE = False
RANK = Scheduler =?= $(DedicatedScheduler)
HIGHPORT = 9700 #Required by my firewall
LOWPORT = 9600 #Required by my firewall
MPI_CONDOR_RSH_PATH = $(LIBEXEC)
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
This was all on Condor side IIRC (6.7.14), Fedora Core 4.
I was then able to run the initial script :
universe = parallel
executable = /bin/cat
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 1
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
queue
So that was OK with condor. But it took someting like 5 minutes to
start during which the job was idle.
In order to speed this up I changed this
NEGOTIATOR_INTERVAL = 61 #300
And the job starts much quicker.
I also managed to run a MPI job on a single node (I think simplempi is
one of the provided MPI examples) with mpich-1.2.4
######################################
## MPI example submit description file
######################################
universe = MPI
executable = simplempi
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 1
should_transfer_files = yes
when_to_transfer_output = on_exit
queue
I still did not run these scripts on several nodes.
Currently I am trying to get the parallel universe to run this MPI
example (it seems that this would allow the use of LAM/newer versions
of MPI).
######################################
## Example submit description file
## for MPICH 1 MPI
## works with MPICH 1.2.4, 1.2.5 and 1.2.6
######################################
universe = parallel
executable = mp1script
arguments = simplempi
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 1
should_transfer_files = yes
when_to_transfer_output = on_exit
queue
Unfortunately the jobs starts 'running' but is blocked. For some reason
it starts some connections, but does not seem to recognize them (and
then try with a next new port, again and again). I tried to look at the
files and find out what might be the reason for this. In
/usr/local/condor/libexec/sshd.sh there is a line like this :
if grep "^Server listening on 0.0.0.0 port" sshd.out > /dev/null 2>&1
I replaced this by :
if grep "Server listening on :: port" sshd.out > /dev/null 2>&1
Not sure at all if there was a typo, but I had the '^' this on the two
computers.
The next problem is that simplempi does not seem to be transfered
towards the temporary folder of the remote node, so there is an error
(can't find the executable). I am not sure if there is a nice way/a few
lines to add/ which would transfer the executable.