Dear friends,
Issue:
I have 3 clusters already installed openmpi. The compiled mpi code works fine locally on each cluster. However, when I tried to use condor_submit, I got the following error:
--------------------------------------------------------------------------
âThe value of the MCA parameter "plm_rsh_agent" was set to a path
âthat could not be found:
â
â plm_rsh_agent: ssh : rsh
â
âPlease either unset the parameter, or check that the path is correct
âââââââââââââââââââââââââââââââââââââ
universe = vanilla
executable = /usr/bin/mpirun
requestMemory = 1024
request_GPUs = 4
request_cpus = 4
arguments = -np 4 ./cmake_tmp/bin/main 0 1 500
log = logs/job_$(Cluster).$(Process).log
output = logs/job_$(Cluster).$(Process).out
error = logs/job_$(Cluster).$(Process).error
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files = ./cmake_tmp/bin/main
queue
Attempts:
I tried "arguments = -mca plm_rsh_agent /usr/lib/condor/libexec/condor_ssh -np 4 ./cmake_tmp/bin/main 0 1 500â. I got errors:
[huashan:596121] [[50260,0],0] ORTE_ERROR_LOG: Not found in file plm_rsh_module.c at line 231
[huashan:596121] [[50260,0],0] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 528
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_plm_init failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
Could anyone please help me sort it out?
Many thanks!
Best,
Max
_______________________________________________