[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI Job not transfering output back.



Hey everyone,

I ran to a bit of a hitch and wanted to see if anyone has a solution to this
problem. I'm trying to run an MPI job and it actually does run but the output
files that are generated are not the ones copied back. Here is an example I
wrote a simple submit file that looks like the following:

[condor@panndaa execute]$ more john_submit
universe = MPI
executable = john
arguments= -test
error = meow$(NODE).err
output = meow$(NODE).txt
log = meow.log
machine_count = 2
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files = john, john.conf
queue

Everything looks normal.. and when i run the job it creates a sub-directory to
my execute directory and starts populating my output file.

[condor@panndaa execute]$ ls
dir_20260  john.conf    meow0.err  meow1.err  meow.log   procgroup.125.0
john       john_submit  meow0.txt  meow1.txt  mp1script

[condor@panndaa execute]$ cd dir_20260/

[condor@panndaa dir_20260]$ ls
condor_exec.exe  john  john.conf  meow0.err  meow0.txt  procgroup.125.0

[condor@panndaa dir_20260]$ more procgroup.125.0
panndaa.NMSU.Edu 0 condor_exec condor
tango.NMSU.Edu 1 condor_exec condor

[condor@panndaa dir_20260]$ more meow0.txt
Benchmarking: Traditional DES [64/64 BS MMX]... DONE
Many salts:     nan c/s real, 1156210.00 c/s virtual

Only one salt:  962521.00 c/s real, 980710.00 c/s virtual

Benchmarking: BSDI DES (x725) [64/64 BS MMX]... DONE
Many salts:     nan c/s real, 39262.00 c/s virtual

Only one salt:  37928.00 c/s real, 38636.00 c/s virtual

Benchmarking: FreeBSD MD5 [32/32]... DONE
Raw:    8769.00 c/s real, 8895.00 c/s virtual


Benchmarking: OpenBSD Blowfish (x32) [32/32]... DONE
Raw:    580.00 c/s real, 587.00 c/s virtual


Benchmarking: Kerberos AFS DES [48/64 4K MMX]...


Now here is where it gets weird.. when the job is over my outputs are completely
empty..

[condor@panndaa execute]$ dir
john  john.conf  john_submit  meow0.err  meow0.txt  meow1.err  meow1.txt 
meow.log  mp1script
[condor@panndaa execute]$ ls -la
total 512
drwxrwxrwt   2 condor condor   4096 Mar 13 21:30 .
drwx------  21 condor condor   4096 Mar 13 21:24 ..
-rwxrwxr-x   1 condor condor 432800 Mar 13 01:26 john
-rw-r--r--   1 condor condor  11755 Mar 13 01:26 john.conf
-rw-r--r--   1 condor condor    240 Mar 13 21:24 john_submit
-rw-rw-r--   1 condor condor      0 Mar 13 21:29 meow0.err
-rw-rw-r--   1 condor condor      0 Mar 13 21:29 meow0.txt
-rw-r--r--   1 condor condor      0 Mar 13 21:30 meow1.err
-rw-r--r--   1 condor condor      0 Mar 13 21:30 meow1.txt
-rw-rw-r--   1 condor condor   3382 Mar 13 21:30 meow.log
-rwxr-xr-x   1 condor condor   1048 Mar 13 01:26 mp1script
-rw-r--r--   1 condor condor    230 Mar 13 03:10 .ssh_host_rsa_key.

Anyone know whats going on?

thanks

Danny N
NMSU