Re: [HTCondor-users] Job Submission in Parallel Universe

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

On Tue, Apr 9, 2013 at 3:45 AM, Muak rules <muakrules@xxxxxxxx> wrote:

Hello

Will you plz tell me in which file I should add this

Scheduler = "DedicatedScheduler@MY_USERNAME@Server.Name.Cleaned"

as I'm not finding this entry in "condor_config" and "condor_config.local"
Date: Mon, 8 Apr 2013 16:29:05 -0400
From: dhentchel@xxxxxxxxx
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Job Submission in Parallel Universe

I don't know whether there may be other limitations that could get in the way, but I do know that if using Personal Condor you need to qualify the hostname strings with your user name, e.g.:

Scheduler = "DedicatedScheduler@MY_USERNAME@Server.Name.Cleaned"
On Mon, Apr 8, 2013 at 3:49 PM, Usman Khan <muakrules@xxxxxxxx> wrote:
Hello all thankx for your great help..
I think I forgot to mentioned that I'm running jobs on personal condor.
till now I hadn't made my own cluster so I jst wanna know is this job is executable on personal condor.

Greetings.

On 04/09/2013 12:13 AM, Andrew Kuelbs wrote:
I am running into an issue with my parallel universe jobs as well. I have just installed condor as these instructions mentioned, with a few other alterations to customize my environment.

I am running RHEL 6.3 Server 64-bit on my master and compute nodes (about 50). I am running Condor 7.8.7. I have tried a few parallel mpi scripts including the one mentioned here in this thread.

I come up with the following error in the log for that script in /var/log/codor/ShadowLog:

04/08/13 14:06:26 Initializing a PARALLEL shadow for job 50.0

04/08/13 14:06:26 (50.0) (24348): condor_write(): Socket closed when trying to write 37 bytes to daemon at <Server.IP.Cleaned:46094>, fd is 5

04/08/13 14:06:26 (50.0) (24348): Buf::write(): condor_write() failed

04/08/13 14:06:26 (50.0) (24348): ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <Server.IP.Cleaned:46094> (try 1 of 3): CEDAR:6002:failed to send EOM

04/08/13 14:06:27 (50.0) (24348): ERROR "Failed to get number of procs" at line 241 in file /slots/01/dir_65060/userdir/src/condor_shadow.V6.1/parallelshadow.cpp

Dedicated Server’s /etc/condor/condor_config.local

## What machine is your central manager?

CONDOR_HOST = Server.Name.Cleaned

## Pool's short description

COLLECTOR_NAME = "Server.Name.Cleaned"

#FOR MPI and other Parallel Universe runs

Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"

## When is this machine willing to start a job?

START = TRUE

## When to suspend a job?

SUSPEND = FALSE

## When to nicely stop a job?

## (as opposed to killing it instantaneously)

PREEMPT = FALSE

## When to instantaneously kill a preempting job

## (e.g. if a job is in the pre-empting stage for too long)

KILL = FALSE

## This macro determines what daemons the condor_master will start and keep its watchful eyes on.

## The list is a comma or space separated list of subsystem names

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD

- - - - - - - - - - - - - - - - - - - - -

All Compute Node’s /etc/condor/condor_config.local

## What machine is your central manager?

CONDOR_HOST = Server.Name.Cleaned

## Pool's short description

COLLECTOR_NAME = "Server.Name.Cleaned"

#FOR Parallel MPI files to run

DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"

STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler

CONTINUE = True

WANT_SUSPEND = False

WANT_VACATE = False

RANK = Scheduler =?= $(DedicatedScheduler)

## When is this machine willing to start a job?

START = TRUE

## When to suspend a job?

SUSPEND = FALSE

## When to nicely stop a job?

## (as opposed to killing it instantaneously)

PREEMPT = FALSE

## When to instantaneously kill a preempting job

## (e.g. if a job is in the pre-empting stage for too long)

KILL = FALSE

## This macro determines what daemons the condor_master will start and keep its watchful eyes on.

## The list is a comma or space separated list of subsystem names

DAEMON_LIST = MASTER, STARTD

STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

----------

Note the STARTD_EXPRS and the STARTD_ATTRS the online manual referenced has STARTD_ATTRS but the example file for Condor 7.8.7 says STARTD_EXPRS so I put both in. When I run a script it just hangs with the following in the job’s log:

007 (049.000.000) 04/08 13:39:56 Shadow exception!

                Failed to get number of procs

                0 - Run Bytes Sent By Job

                0 - Run Bytes Received By Job

...

007 (049.000.000) 04/08 13:40:00 Shadow exception!

                Failed to get number of procs

                0 - Run Bytes Sent By Job

                0 - Run Bytes Received By Job

...

009 (049.000.000) 04/08 13:40:00 Job was aborted by the user.

                via condor_rm (by user condor)

I tried just tossing in the example script as it was with the server name changed and simply get the same results. A condor_status lists all the Compute Nodes in an unclaimed Idle state. When I specify machine_count=X . X number of nodes become claimed but the job just sits idle. Is there anyone who has any thoughts on this?

From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of David Hentchel
Sent: Monday, April 08, 2013 11:39 AM
To: jerome.leconte@xxxxxxxxxxxxxxx; HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Job Submission in Parallel Universe

I just went through this as a beginner. The key is to set up every host running a Start daemon to reference a "dedicated scheduler" in the pool, according the the instructions in the manual section labelled "3.12.8 HTCondor's Dedicated Scheduling". You can also merge the examples in the condor install (see $INSTALL_DIR/etc/examples/condor_config.local.dedicated.resource). All I had to change from that example was the hostname for the machine I wanted to use for the dedicated scheduler.

Hope this helps.

On Mon, Apr 8, 2013 at 8:45 AM, leconte <jerome.leconte@xxxxxxxxxxxxxxx> wrote:

Le 08/04/2013 09:53, Muak rules a écrit :

I'm submitting my first parallel universe job. which is basically a simple hello world problem.
I'm using CentOs 6.3 and had installed HTCondor using "yum install condor".
I'm not sure about about version of HTCondor. When I submit it the job goes on idle state.
Please help me out through this......Following is my submit description file.

universe=parallel
executable=mpi_hello_world
machine_count=1
log=hello.log
out=hello.out
queue

The following attachment contain my job.

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

hello Muak,

I have testing your prog and submit file on my own test cluster. it work fine.

I suppose it is not your prog nor submit file but your configuration that cause the problem.

I've corrected only the line
out=hello.out
by
output=hello.out
my condor complains about it.

I don't know if I can correct your problem but can you post your cluster config ?

Greetings

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

--

David Hentchel

Performance Engineer

www.nuodb.com

(617) 803 - 1193
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
David Hentchel
Performance Engineer

www.nuodb.com

(617) 803 - 1193
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

David Hentchel

Performance Engineer

www.nuodb.com

(617) 803 - 1193

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Job Submission in Parallel Universe