Subject: Re: [Condor-users] HYPHY parallel MPI implementation on Condor
Hi Javier,
You set:
export P4_RSHCOMMAND=$CONDOR_SSH So when mpirun is used, condor_ssh is
executed to connect remote processes.
Since your job is running fine when
launched directly, i. e. without using Condor, a possibility is that the
error occurs when the condor_ssh script is executed.
It could be that the options used to
run with P4_RSHCOMMAND by HYPHYMPI are not supported by condor_ssh. I guess
you can backup your current condor_ssh script and modify it for debugging
purpose if necessary.
I hope this helps,
Best regards,
--
Cecile GARROS
IT Service Group
Advanced Technologies Division
--------------------------------
ARGO GRAPHICS INC.
www.argo-graph.co.jp
tel: 03(5641)2025
fax: 03(5641)2011
--------------------------------
Javier Forment Millet <jforment@xxxxxxxxxxxx> 送信者: condor-users-bounces@xxxxxxxxxxx
##################
# StarterLog.vm2 (at virtual machine 2 of node executing one of the MPI
parallel
processes)
##################
10/4 18:59:01 **** condor_starter (condor_STARTER) EXITING WITH STATUS
0
10/6 13:18:05 ******************************************************
10/6 13:18:05 ** condor_starter (CONDOR_STARTER) STARTING UP
10/6 13:18:05 ** /usr/local/condor-6.8.6/sbin/condor_starter
10/6 13:18:05 ** $CondorVersion: 6.8.6 Sep 13 2007 $
10/6 13:18:05 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/6 13:18:05 ** PID = 25176
10/6 13:18:05 ** Log last touched 10/4 18:59:01
10/6 13:18:05 ******************************************************
10/6 13:18:05 Using config source: /home/condor/condor_config
10/6 13:18:05 Using local config sources:
10/6 13:18:05 /home/condor/condor_config.local
10/6 13:18:05 DaemonCore: Command Socket at <192.168.1.22:42909>
10/6 13:18:05 Done setting resource limits
10/6 13:18:05 Communicating with shadow <192.168.1.21:45720>
10/6 13:18:05 Submitting machine is "bioxeon.ibmcp-cluster.upv.es"
10/6 13:18:05 Job has WantIOProxy=true
10/6 13:18:05 Initialized IO Proxy.
10/6 13:18:05 File transfer completed successfully.
10/6 13:18:06 Starting a PARALLEL universe job with ID: 2221.0
10/6 13:18:06 IWD: /home/condor/execute/dir_25176
10/6 13:18:06 Output file: /home/condor/execute/dir_25176/out.2.txt
10/6 13:18:06 Error file: /home/condor/execute/dir_25176/out.2.txt
10/6 13:18:06 About to exec /home/condor/execute/dir_25176/condor_exec.exe
HYPHYMPI MPITest.bf
10/6 13:18:06 Create_Process succeeded, pid=25180
10/6 13:18:08 IOProxy: accepting connection from 192.168.1.22
10/6 13:18:08 IOProxyHandler: closing connection to 192.168.1.22
10/6 13:18:12 IOProxy: accepting connection from 192.168.1.22
10/6 13:18:12 IOProxyHandler: closing connection to 192.168.1.22
10/6 13:18:32 Got SIGQUIT. Performing fast shutdown.
10/6 13:18:32 ShutdownFast all jobs.
10/6 13:18:32 Process exited, pid=25180, status=0
10/6 13:18:32 condor_write(): Socket closed when trying to write 260 bytes
to
<192.168.1.21:52511>, fd is 5
10/6 13:18:32 Buf::write(): condor_write() failed
10/6 13:18:32 Failed to send job exit status to shadow
10/6 13:18:32 JobExit() failed, waiting for job lease to expire or for
a
reconnect attempt
10/6 13:18:32 Last process exited, now Starter is exiting
10/6 13:18:32 **** condor_starter (condor_STARTER) EXITING WITH STATUS
0
--
Javier Forment Millet
Instituto de Biolog?a Celular y Molecular de Plantas (IBMCP) CSIC-UPV
Ciudad Polit?cnica de la Innovaci?n (CPI) Edificio 8 E, Escalera 7 Puerta
E
Calle Ing. Fausto Elio s/n. 46022 Valencia, Spain
Tlf.:+34-96-3877858
FAX: +34-96-3877859
jforment@xxxxxxxxxxxx
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users