Hello, There is one problem for me regarding MPI (mpich2) with condor on Windows.
I followed the instructions from condor manual to set the configurations for parallel job. Following link also provides many useful information: http://www.itk.org/Wiki/Proposals:Condor Most things are fine, and the MPI job can be executed successfully with condor on single execute machine. But there is one problem for job executing on two machines. ==== My Condor configurations ===== Condor version: 8.4.1 submit: submit machine, central manager execute-1: execute machine 1, with 20 cpus execute-2: execute machine 2, with 20 cpus MPI: mpich2-1.4.1p1-x86-64 MPI application: app.exe =============================== For instance, when I set machine_count = 30 in the parallel submit file. The 30 cpus are correctly claimed, e.g. 20 on execute-1 and 10 on execute-2. But the job is only executed on execute-1 machine. There are 30 app.exe daemons one execute-1, and no this daemon on execute-2. Given that there are only 20 cpus on execute-1.
The job is finished like this: 20 app.exe daemons are executed firstly, once there are free resource, the remaining 10 daemons begin to run. On execute-2 machine, there are only 10 condor_starter daemons, no app.exe daemon. I will appreciate very much if someone could give some help on this, and I have digged this problem for few days, but still failed. If further information is needed, let me know. Thanks. Best regards, Linlin |