Hi, Given that there is no response these days, I would try to provide more information you may want to look at. ------------------------------------ condor.config ------------------------------ CONDOR_HOST = sctmc.ctmc.com COLLECTOR_NAME = CTMC UID_DOMAIN = $(CONDOR_HOST) CONDOR_ADMIN = SMTP_SERVER = ALLOW_READ = * ALLOW_WRITE = * ALLOW_ADMINISTRATOR = $(CONDOR_HOST), $(IP_ADDRESS) CREDD_HOST = sctmc.ctmc.com STARTER_ALLOW_RUNAS_OWNER = True CREDD_CACHE_LOCALLY = True SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD ALLOW_CONFIG = zhanglinlin1@ctmc START = FALSE WANT_VACATE = FALSE WANT_SUSPEND = TRUE DAEMON_LIST = MASTER, SCHEDD, COLLECTOR, NEGOTIATOR BIND_ALL_INTERFACES = FALSE -------------------------------------------------------------------------------------- This is for central manager. For working nodes, it is similar except the daemon related lines. --------------------- condor.config.local only show parallel settings ----------- #SMPD_SERVER = C:\Program Files\MPICH2\bin\smpd.exe #SMPD_SERVER_ARGS = -p 6666 -d #SMPD_SERVER_LOG = $(LOG)\SmpdLog DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxx" STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxx" MPI_CONDOR_RSH_PATH = $(LIBEXEC) START = True SUSPEND = False CONTINUE = True PREEMPT = False KILL = False WANT_SUSPEND = False WANT_VACATE = False RANK = Scheduler =?= $(DedicatedScheduler) #DAEMON_LIST = $(DAEMON_LIST), SMPD_SERVER -------------------------------------------------------------------------------------------- I also tried to uncommented the lines with respect to SMPD service in condor.config.local file, but this did not solve the problem.
---------------------- submit file ------------------------------------------ universe = parallel executable = mp2script.bat arguments = \\sctmc\d\condor\myapp.exe machine_count = 30 output = parallel_out.$(NODE).log error = parallel_error.$(NODE).log log = parallel_log.$(NODE).log should_transfer_files = yes when_to_transfer_output = on_exit run_as_owner = True queue ------------------------------------------------------------------------------ There is another problem, the produced log file name is
parallel_log.#pArAlLeLnOdE#, which is not correct. I did not find errors about this in condor log files. Any suggestions ? Thanks, Linlin 发件人: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
代表 张琳琳1 Hello, There is one problem for me regarding MPI (mpich2) with condor on Windows.
I followed the instructions from condor manual to set the configurations for parallel job. Following link also provides many useful information: http://www.itk.org/Wiki/Proposals:Condor Most things are fine, and the MPI job can be executed successfully with condor on single execute machine. But there is one problem for job executing on two machines. ==== My Condor configurations ===== Condor version: 8.4.1 submit: submit machine, central manager execute-1: execute machine 1, with 20 cpus execute-2: execute machine 2, with 20 cpus MPI: mpich2-1.4.1p1-x86-64 MPI application: app.exe =============================== For instance, when I set machine_count = 30 in the parallel submit file. The 30 cpus are correctly claimed, e.g. 20 on execute-1 and 10 on execute-2. But the job is only executed on execute-1 machine. There are 30 app.exe daemons one execute-1, and no this daemon on execute-2. Given that there are only 20 cpus on execute-1.
The job is finished like this: 20 app.exe daemons are executed firstly, once there are free resource, the remaining 10 daemons begin to run. On execute-2 machine, there are only 10 condor_starter daemons, no app.exe daemon. I will appreciate very much if someone could give some help on this, and I have digged this problem for few days, but still failed. If further information is needed, let me know. Thanks. Best regards, Linlin |