Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[condor-users] file transferring (or not) in vanilla and mpi universes
- Date: Thu, 4 Mar 2004 17:45:09 -0000
- From: "Kewley, J (John)" <J.Kewley@xxxxxxxx>
- Subject: [condor-users] file transferring (or not) in vanilla and mpi universes
I am running Condor 6.6.1 (6.6.0 had same "problems") in my pool:
Linux 7.3 cluster comprising:
* dual headnode
* 8x single workernodes
headnode is configured as dedicated scheduler for cluster, it is setup
(as in the manual) for running opportunistic jobs as well. headnode
is also the master for the condor pool (negotiator and collector run here).
There are no other schedulers in the pool.
Start daemons are currently on all nodes (incl headnode).
Users on the headnode do not neccessarily have a useraccount on the
workers.
Some filestore is shared between the headnode and the workers.
* /home/* (including headnode users and condor)
* /opt/<some> (including condor and mpi)
I have the following settings on my headnode condor_config.local:
FILESYSTEM_DOMAIN = ibmcluster
UID_DOMAIN = $(HOSTNAME).dl.ac.uk
I have the following settings on my headnode condor_config.local:
FILESYSTEM_DOMAIN = ibmcluster
UID_DOMAIN = $(HOSTNAME).ibmcluster
I have then done some experiments to get to grips with
the file transferring options for vanilla and mpi universes.
I tried varying the following:
* whether the lines:
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
were commented out or not
whether the log, output and error files existed before
submitting and
if the log, output and error files already existed, whether
they had rw (666) permissions for group+world as well as owner.
As I had a shared filesystem, I thought that I would
be able to use file transfer, or just leave them where they were.
Here are the results:
=============================================
VANILLA
File transfer requested
log, etc files: exist 666 perms - SUCCESS
log, etc files: exist 644 perms - SUCCESS
log, etc files: don't exist - SUCCESS
File transfer not requested
log, etc files: exist 666 perms - SUCCESS
log, etc files: exist 644 perms - FAILURE
(log file:)
Error from starter on node3.ibmcluster:
Failed to open standard output file
'/home/jktest/test/vanilla/2.output':
Permission denied (errno 13)
[ad inf]
log, etc files: don't exist - FAILURE
First it creates all the files correctly, but leaves them as
644, then it fails as follows:
(log file:)
Error from starter on node3.ibmcluster:
Failed to open standard output file
'/home/jktest/test/vanilla/2.output':
Permission denied (errno 13)
[ad inf]
OK, so far no major surprises except that it'd be nice
if Condor didn't create files with permissions that it
had no chance of writing to later. BTW this made no
difference if the user had an account on the other machine (presumably
because I had set the 2 UID_DOMAINs to be separate).
I then expected my MPI programs to behave in a similar way,
...
=============================================
MPI
File transfer requested
log, etc files: exist 666 perms - FAILURE
log, etc files: exist 644 perms - FAILURE
log, etc files: don't exist - FAILURE
all fail in the same way:
appear to have succeeded (according to email message)
but output files are not right
initially sets 0.output and 0.error to empty,
leaving permissions as before
eventually creates:
---xr--r-- 1 jmk27 dlarcg 0 Feb 18 14:25 #MpInOdE#.error
-rw---x--- 1 jmk27 dlarcg 51 Feb 18 14:25 #MpInOdE#.output
--w---x--- 1 jmk27 dlarcg 51 Feb 18 14:25 #MpInOdE#.output
-r----x--- 1 jmk27 dlarcg 51 Feb 18 14:25 #MpInOdE#.output
-r----x--T 1 jmk27 dlarcg 51 Feb 18 14:25 #MpInOdE#.output
--w---x--T 1 jmk27 dlarcg 51 Feb 18 14:25 #MpInOdE#.output
and then completes. The above output file is consistent with
one of the jobs having completed successfully (presumably all
write to the same file)
File transfer not requested
log, etc files: exist 666 perms - SUCCESS
log, etc files: exist 644 perms - FAILURE
as analogous vanilla test above
log, etc files: don't exist - FAILURE
as analogous vanilla test above
=============================================
submit files
-----------------------------
UNIVERSE = vanilla
EXECUTABLE = vanillatest
REQUIREMENTS = ( OpSys == "LINUX" )
LOG = $(UNIVERSE)/log
ERROR = $(UNIVERSE)/$(PROCESS).error
INPUT = $(UNIVERSE)/$(PROCESS).input
OUTPUT = $(UNIVERSE)/$(PROCESS).output
# Following 2 lines are not needed for a shared filesystem
# as long as either:
# a) output and error files already exist with 666 permissions, or
# b) (presumably) same uid_domain
#SHOULD_TRANSFER_FILES = YES
#WHEN_TO_TRANSFER_OUTPUT = ON_EXIT
QUEUE 4
-----------------------------
UNIVERSE = mpi
EXECUTABLE = mpitest
REQUIREMENTS = ( OpSys == "LINUX" )
LOG = $(UNIVERSE)/log
ERROR = $(UNIVERSE)/$(NODE).error
INPUT = $(UNIVERSE)/$(NODE).input
OUTPUT = $(UNIVERSE)/$(NODE).output
MACHINE_COUNT = 4
SHOULD_TRANSFER_FILES = YES
WHEN_TO_TRANSFER_OUTPUT = ON_EXIT
QUEUE
-----------------------------
So, finally (sorry!):
* is the above behaviour what people would expect for the semantics
of file transfer and non file transfer modes?
* Should there be a difference in this between mpi and vanilla universes?
* Why are the 0.output and 0.error files created, but not the others,
and why aren't they written to?
Cheers
JK
John Kewley
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>