Hi,
I'm trying to run mpich under htcondor. I've set the dedicated scheduler, and they work. Then I tried to run pi_montecarlo.x application. From the Log file, it look like run well, but finally i do not get any result.
#### Submission file:
######################################
## Example submit description file
## for MPICH 1 MPI
## works with MPICH 1.2.4, 1.2.5 and 1.2.6
######################################
universe = parallel
executable = mp1script
arguments = pi_montecarlo.x
machine_count = 1
output = loop.out
error = loop.error
log = loop.log
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files = pi_montecarlo.x
queue
#### Log file:
000 (018.000.000) 01/15 11:38:41 Job submitted from host: <
10.3.16.144:55930>
...
014 (018.000.000) 01/15 11:38:56 Node 0 executing on host: <
10.3.16.112:39838>
...
001 (018.000.000) 01/15 11:38:56 Job executing on host: MPI_job
...
015 (018.000.000) 01/15 11:39:04 Node 0 terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
479 - Run Bytes Sent By Node
1317684 - Run Bytes Received By Node
479 - Total Bytes Sent By Node
1317684 - Total Bytes Received By Node
...
005 (018.000.000) 01/15 11:39:05 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
479 - Run Bytes Sent By Job
1317684 - Run Bytes Received By Job
479 - Total Bytes Sent By Job
1317684 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 1500 1500 13561016
Memory (MB) : 3 1 1995
...
#############
From the Error file, i got this:
/etc/condor/var/execute/dir_15035/condor_exec.exe: 125: [: Illegal number: pi_montecarlo.x
/etc/condor/var/execute/dir_15035/condor_exec.exe: 35: [: Illegal number: pi_montecarlo.x
/etc/condor/var/execute/dir_15035/condor_exec.exe: 61: /etc/condor/var/execute/dir_15035/condor_exec.exe: cannot open /etc/condor/var/execute/dir_15035/contact: No such file
/etc/condor/var/execute/dir_15035/condor_exec.exe: 64: /etc/condor/var/execute/dir_15035/condor_exec.exe: mpirun: not found
##############
This is the StarterLog of 10.3.16.112 (worker)
01/15/14 11:38:50 ** condor_starter (CONDOR_STARTER) STARTING UP
01/15/14 11:38:50 ** /etc/condor/sbin/condor_starter
01/15/14 11:38:50 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
01/15/14 11:38:50 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
01/15/14 11:38:50 ** $CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $
01/15/14 11:38:50 ** $CondorPlatform: x86_64_Ubuntu12 $
01/15/14 11:38:50 ** PID = 15035
01/15/14 11:38:50 ** Log last touched 1/15 10:48:16
01/15/14 11:38:50 ******************************************************
01/15/14 11:38:50 Using config source: /etc/condor/etc/condor_config
01/15/14 11:38:50 Using local config sources:
01/15/14 11:38:50 /etc/condor/var/condor_config.local
01/15/14 11:38:50 DaemonCore: command socket at <
10.3.16.112:33481>
01/15/14 11:38:50 DaemonCore: private command socket at <
10.3.16.112:33481>
01/15/14 11:38:50 Communicating with shadow <
10.3.16.144:41776?noUDP>
01/15/14 11:38:50 Submitting machine is "hpclab.abcd.efg.hi"
01/15/14 11:38:50 setting the orig job name in starter
01/15/14 11:38:50 setting the orig job iwd in starter
01/15/14 11:38:50 Job has WantIOProxy=true
01/15/14 11:38:50 Initialized IO Proxy.
01/15/14 11:38:50 Done setting resource limits
01/15/14 11:38:50 File transfer completed successfully.
01/15/14 11:38:51 Job 18.0 set to execute immediately
01/15/14 11:38:51 Starting a PARALLEL universe job with ID: 18.0
01/15/14 11:38:51 IWD: /etc/condor/var/execute/dir_15035
01/15/14 11:38:51 Output file: /etc/condor/var/execute/dir_15035/_condor_stdout
01/15/14 11:38:51 Error file: /etc/condor/var/execute/dir_15035/_condor_stderr
01/15/14 11:38:56 About to exec /etc/condor/var/execute/dir_15035/condor_exec.exe pi_montecarlo.x
01/15/14 11:38:56 Setting job's virtual memory rlimit to 0 megabytes
01/15/14 11:38:56 Running job as user nobody
01/15/14 11:38:56 Create_Process succeeded, pid=15039
01/15/14 11:38:57 condor_write() failed: send() 1 bytes to <
10.3.16.112:39661> returned -1, timeout=0, errno=32 Broken pipe.
01/15/14 11:38:59 Process exited, pid=15039, status=0
01/15/14 11:39:00 Got SIGQUIT. Performing fast shutdown.
01/15/14 11:39:00 ShutdownFast all jobs.
01/15/14 11:39:00 **** condor_starter (condor_STARTER) pid 15035 EXITING WITH STATUS 0
####
I guess the problem is related to "condor_write() failed: send() 1 bytes to <
10.3.16.112:47961> returned -1, timeout=0, errno=32 Broken pipe". But i do not how to solve this problem. And i am also wondering why in the linux environment there is condor_exec.exe (see error file).
I hope someone can help me.
Thank you so much before.