I am running into an issue with my parallel universe jobs as well. I have just installed condor as these instructions mentioned, with a few other alterations to customize my environment. I am running RHEL 6.3 Server 64-bit on my master and compute nodes (about 50). I am running Condor 7.8.7. I have tried a few parallel mpi scripts including the one mentioned here in this thread. I come up with the following error in the log for that script in /var/log/codor/ShadowLog: 04/08/13 14:06:26 Initializing a PARALLEL shadow for job 50.0 04/08/13 14:06:26 (50.0) (24348): condor_write(): Socket closed when trying to write 37 bytes to daemon at <Server.IP.Cleaned:46094>, fd is 5 04/08/13 14:06:26 (50.0) (24348): Buf::write(): condor_write() failed 04/08/13 14:06:26 (50.0) (24348): ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <Server.IP.Cleaned:46094> (try 1 of 3): CEDAR:6002:failed to send EOM 04/08/13 14:06:27 (50.0) (24348): ERROR "Failed to get number of procs" at line 241 in file /slots/01/dir_65060/userdir/src/condor_shadow.V6.1/parallelshadow.cpp Dedicated Server’s /etc/condor/condor_config.local ## What machine is your central manager? CONDOR_HOST = Server.Name.Cleaned ## Pool's short description COLLECTOR_NAME = "Server.Name.Cleaned" #FOR MPI and other Parallel Universe runs Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx" ## When is this machine willing to start a job? START = TRUE ## When to suspend a job? SUSPEND = FALSE ## When to nicely stop a job? ## (as opposed to killing it instantaneously) PREEMPT = FALSE ## When to instantaneously kill a preempting job ## (e.g. if a job is in the pre-empting stage for too long) KILL = FALSE ## This macro determines what daemons the condor_master will start and keep its watchful eyes on. ## The list is a comma or space separated list of subsystem names DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD - - - - - - - - - - - - - - - - - - - - - All Compute Node’s /etc/condor/condor_config.local ## What machine is your central manager? CONDOR_HOST = Server.Name.Cleaned ## Pool's short description COLLECTOR_NAME = "Server.Name.Cleaned" #FOR Parallel MPI files to run DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx" STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler CONTINUE = True WANT_SUSPEND = False WANT_VACATE = False RANK = Scheduler =?= $(DedicatedScheduler) ## When is this machine willing to start a job? START = TRUE ## When to suspend a job? SUSPEND = FALSE ## When to nicely stop a job? ## (as opposed to killing it instantaneously) PREEMPT = FALSE ## When to instantaneously kill a preempting job ## (e.g. if a job is in the pre-empting stage for too long) KILL = FALSE ## This macro determines what daemons the condor_master will start and keep its watchful eyes on. ## The list is a comma or space separated list of subsystem names DAEMON_LIST = MASTER, STARTD STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler ---------- Note the STARTD_EXPRS and the STARTD_ATTRS the online manual referenced has STARTD_ATTRS but the example file for Condor 7.8.7 says STARTD_EXPRS so I put both in. When I run a script it just hangs with the following in the job’s log: 007 (049.000.000) 04/08 13:39:56 Shadow exception! Failed to get number of procs 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... 007 (049.000.000) 04/08 13:40:00 Shadow exception! Failed to get number of procs 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... 009 (049.000.000) 04/08 13:40:00 Job was aborted by the user. via condor_rm (by user condor) I tried just tossing in the example script as it was with the server name changed and simply get the same results. A condor_status lists all the Compute Nodes in an unclaimed Idle state. When I specify machine_count=X . X number of nodes become claimed but the job just sits idle. Is there anyone who has any thoughts on this? From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of David Hentchel I just went through this as a beginner. The key is to set up every host running a Start daemon to reference a "dedicated scheduler" in the pool, according the the instructions in the manual section labelled "3.12.8 HTCondor's Dedicated Scheduling". You can also merge the examples in the condor install (see $INSTALL_DIR/etc/examples/condor_config.local.dedicated.resource). All I had to change from that example was the hostname for the machine I wanted to use for the dedicated scheduler. Hope this helps. On Mon, Apr 8, 2013 at 8:45 AM, leconte <jerome.leconte@xxxxxxxxxxxxxxx> wrote: Le 08/04/2013 09:53, Muak rules a écrit : I'm submitting my first parallel universe job. which is basically a simple hello world problem. _______________________________________________ hello Muak, -- David Hentchel Performance Engineer (617) 803 - 1193 |