Re: [HTCondor-users] Job Submission in Parallel Universe

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

I am running into an issue with my parallel universe jobs as well. I have just installed condor as these instructions mentioned, with a few other alterations to customize my environment.

I am running RHEL 6.3 Server 64-bit on my master and compute nodes (about 50). I am running Condor 7.8.7. I have tried a few parallel mpi scripts including the one mentioned here in this thread.

I come up with the following error in the log for that script in /var/log/codor/ShadowLog:

04/08/13 14:06:26 Initializing a PARALLEL shadow for job 50.0

04/08/13 14:06:26 (50.0) (24348): condor_write(): Socket closed when trying to write 37 bytes to daemon at <Server.IP.Cleaned:46094>, fd is 5

04/08/13 14:06:26 (50.0) (24348): Buf::write(): condor_write() failed

04/08/13 14:06:26 (50.0) (24348): ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <Server.IP.Cleaned:46094> (try 1 of 3): CEDAR:6002:failed to send EOM

04/08/13 14:06:27 (50.0) (24348): ERROR "Failed to get number of procs" at line 241 in file /slots/01/dir_65060/userdir/src/condor_shadow.V6.1/parallelshadow.cpp

Dedicated Server’s /etc/condor/condor_config.local

## What machine is your central manager?

CONDOR_HOST = Server.Name.Cleaned

## Pool's short description

COLLECTOR_NAME = "Server.Name.Cleaned"

#FOR MPI and other Parallel Universe runs

Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"

## When is this machine willing to start a job?

START = TRUE

## When to suspend a job?

SUSPEND = FALSE

## When to nicely stop a job?

## (as opposed to killing it instantaneously)

PREEMPT = FALSE

## When to instantaneously kill a preempting job

## (e.g. if a job is in the pre-empting stage for too long)

KILL = FALSE

## This macro determines what daemons the condor_master will start and keep its watchful eyes on.

## The list is a comma or space separated list of subsystem names

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD

- - - - - - - - - - - - - - - - - - - - -

All Compute Node’s /etc/condor/condor_config.local

## What machine is your central manager?

CONDOR_HOST = Server.Name.Cleaned

## Pool's short description

COLLECTOR_NAME = "Server.Name.Cleaned"

#FOR Parallel MPI files to run

DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"

STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler

CONTINUE = True

WANT_SUSPEND = False

WANT_VACATE = False

RANK = Scheduler =?= $(DedicatedScheduler)

## When is this machine willing to start a job?

START = TRUE

## When to suspend a job?

SUSPEND = FALSE

## When to nicely stop a job?

## (as opposed to killing it instantaneously)

PREEMPT = FALSE

## When to instantaneously kill a preempting job

## (e.g. if a job is in the pre-empting stage for too long)

KILL = FALSE

## This macro determines what daemons the condor_master will start and keep its watchful eyes on.

## The list is a comma or space separated list of subsystem names

DAEMON_LIST = MASTER, STARTD

STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

----------

Note the STARTD_EXPRS and the STARTD_ATTRS the online manual referenced has STARTD_ATTRS but the example file for Condor 7.8.7 says STARTD_EXPRS so I put both in. When I run a script it just hangs with the following in the job’s log:

007 (049.000.000) 04/08 13:39:56 Shadow exception!

Failed to get number of procs

0 - Run Bytes Sent By Job

0 - Run Bytes Received By Job

...

007 (049.000.000) 04/08 13:40:00 Shadow exception!

Failed to get number of procs

0 - Run Bytes Sent By Job

0 - Run Bytes Received By Job

...

009 (049.000.000) 04/08 13:40:00 Job was aborted by the user.

via condor_rm (by user condor)

I tried just tossing in the example script as it was with the server name changed and simply get the same results. A condor_status lists all the Compute Nodes in an unclaimed Idle state. When I specify machine_count=X . X number of nodes become claimed but the job just sits idle. Is there anyone who has any thoughts on this?

From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of David Hentchel
Sent: Monday, April 08, 2013 11:39 AM
To: jerome.leconte@xxxxxxxxxxxxxxx; HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Job Submission in Parallel Universe

I just went through this as a beginner. The key is to set up every host running a Start daemon to reference a "dedicated scheduler" in the pool, according the the instructions in the manual section labelled "3.12.8 HTCondor's Dedicated Scheduling". You can also merge the examples in the condor install (see $INSTALL_DIR/etc/examples/condor_config.local.dedicated.resource). All I had to change from that example was the hostname for the machine I wanted to use for the dedicated scheduler.

Hope this helps.

On Mon, Apr 8, 2013 at 8:45 AM, leconte <jerome.leconte@xxxxxxxxxxxxxxx> wrote:

Le 08/04/2013 09:53, Muak rules a écrit :

I'm submitting my first parallel universe job. which is basically a simple hello world problem.
I'm using CentOs 6.3 and had installed HTCondor using "yum install condor".
I'm not sure about about version of HTCondor. When I submit it the job goes on idle state.
Please help me out through this......Following is my submit description file.

universe=parallel
executable=mpi_hello_world
machine_count=1
log=hello.log
out=hello.out
queue

The following attachment contain my job.

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

hello Muak,

I have testing your prog and submit file on my own test cluster. it work fine.

I suppose it is not your prog nor submit file but your configuration that cause the problem.

I've corrected only the line
out=hello.out
by
output=hello.out
my condor complains about it.

I don't know if I can correct your problem but can you post your cluster config ?

Greetings

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

David Hentchel

Performance Engineer

www.nuodb.com

(617) 803 - 1193

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Job Submission in Parallel Universe