Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] New to Condor, Need to RUN MPI
- Date: Fri, 30 Jan 2009 13:14:24 -0500
- From: Samir Khanal <skhanal@xxxxxxxx>
- Subject: [Condor-users] New to Condor, Need to RUN MPI
Hi
I am using and configuring condor for the first time and was trying to get a sample to work on my cluster
(its rocks 5.1 with Condor)
I was able to get the app to work on pbs/torque but i am having hard time having condor configured for MPI
I have changed the condor_config.local in compute-0-0 to be the MPI machine with Dedicated Scheduler
condor_status shows
---------------------
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@compute-0-0. LINUX X86_64 Owner Idle 0.000 954 0+00:02:27
slot2@compute-0-0. LINUX X86_64 Owner Idle 0.000 954 0+00:02:28
slot3@compute-0-0. LINUX X86_64 Owner Idle 0.000 954 0+00:02:29
slot4@compute-0-0. LINUX X86_64 Owner Idle 0.000 954 0+00:02:30
slot1@compute-0-1. LINUX X86_64 Unclaimed Idle 0.000 954 1+12:29:53
slot2@compute-0-1. LINUX X86_64 Unclaimed Idle 0.000 954 0+00:10:05
slot3@compute-0-1. LINUX X86_64 Unclaimed Idle 0.000 954 1+23:56:07
slot4@compute-0-1. LINUX X86_64 Unclaimed Idle 0.000 954 1+23:56:08
slot1@compute-0-2. LINUX X86_64 Unclaimed Idle 0.000 954 0+00:05:04
slot2@compute-0-2. LINUX X86_64 Unclaimed Idle 0.000 954 1+23:51:08
slot3@compute-0-2. LINUX X86_64 Unclaimed Idle 0.000 954 1+23:51:09
slot4@compute-0-2. LINUX X86_64 Unclaimed Idle 0.000 954 1+23:51:10
slot1@compute-0-3. LINUX X86_64 Unclaimed Idle 0.000 954 1+12:30:23
slot2@compute-0-3. LINUX X86_64 Unclaimed Idle 0.010 954 0+00:05:05
slot3@compute-0-3. LINUX X86_64 Unclaimed Idle 0.000 954 1+23:51:10
slot4@compute-0-3. LINUX X86_64 Unclaimed Idle 0.000 954 1+23:51:11
slot1@compute-0-4. LINUX X86_64 Unclaimed Idle 0.000 954 1+12:27:23
slot2@compute-0-4. LINUX X86_64 Unclaimed Idle 0.000 954 0+00:00:00
slot3@compute-0-4. LINUX X86_64 Unclaimed Idle 0.000 954 1+23:46:11
slot4@compute-0-4. LINUX X86_64 Unclaimed Idle 0.000 954 1+23:46:12
slot1@compute-0-5. LINUX X86_64 Unclaimed Idle 0.010 954 0+00:00:00
slot2@compute-0-5. LINUX X86_64 Unclaimed Idle 0.000 954 1+23:46:06
slot3@compute-0-5. LINUX X86_64 Unclaimed Idle 0.000 954 1+23:46:07
slot4@compute-0-5. LINUX X86_64 Unclaimed Idle 0.000 954 1+23:46:08
My Job file
----------------
universe = MPI
executable = /home/skhanal/condor/bones
log = logfile
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 2
should_transfer_files = yes
when_to_transfer_output = on_exit
queue
the job when submitted goes into "R" mode and ends with following messages on the output and log files.
Output.0 says
------------------------
p0_4788: p4_error: Child process exited while making connection to remote process on compute-0-0.local: 0
p0_4788: (6.007812) net_send: could not write to fd=4, errno = 32
and output.1 says
-------------------------
rm_4794: (-) net_recv failed for fd = 3
rm_4794: p4_error: net_recv read, errno = : 104
the logfile says
----------------------------
000 (029.000.000) 01/30 12:53:45 Job submitted from host: <129.1.64.81:39320>
...
014 (029.000.000) 01/30 12:53:48 Node 0 executing on host: <10.1.255.254:54415>
...
014 (029.000.001) 01/30 12:53:49 Node 1 executing on host: <10.1.255.254:54415>
...
015 (029.000.000) 01/30 12:53:54 Node 0 terminated.
(1) Normal termination (return value 1)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
173 - Run Bytes Sent By Node
1919489 - Run Bytes Received By Node
173 - Total Bytes Sent By Node
1919489 - Total Bytes Received By Node
...
015 (029.000.001) 01/30 12:53:54 Node 1 terminated.
(1) Normal termination (return value 139)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
89 - Run Bytes Sent By Node
1919329 - Run Bytes Received By Node
89 - Total Bytes Sent By Node
1919329 - Total Bytes Received By Node
...
005 (029.000.000) 01/30 12:53:54 Job terminated.
(1) Normal termination (return value 1)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
262 - Run Bytes Sent By Job
3838818 - Run Bytes Received By Job
262 - Total Bytes Sent By Job
3838818 - Total Bytes Received By Job
--------------------------------------------------------------------
Is there anything else i need to change for the MPI to work?
I read something about shadow, but could not quite get if that is needed to get condor working for MPI.
Please help
Samir Khanal
Networking Lab
Bowling Green State University
Bowling Green, OH 43402
skhanal@xxxxxxxx