Scheduler = "DedicatedScheduler@Server.Name.Cleaned"
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"
STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
Hello
Will you plz tell me in which file I should add this
Scheduler = "DedicatedScheduler@MY_USERNAME@Server.Name.Cleaned"
as I'm not finding this entry in "condor_config" and "condor_config.local"
Date: Mon, 8 Apr 2013 16:29:05 -0400
From: dhentchel@xxxxxxxxx
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Job Submission in Parallel UniverseI don't know whether there may be other limitations that could get in the way, but I do know that if using Personal Condor you need to qualify the hostname strings with your user name, e.g.:Scheduler = "DedicatedScheduler@MY_USERNAME@Server.Name.Cleaned"
On Mon, Apr 8, 2013 at 3:49 PM, Usman Khan <muakrules@xxxxxxxx> wrote:
Hello all thankx for your great help..
I think I forgot to mentioned that I'm running jobs on personal condor.
till now I hadn't made my own cluster so I jst wanna know is this job is executable on personal condor.
Greetings.
On 04/09/2013 12:13 AM, Andrew Kuelbs wrote:
I am running into an issue with my parallel universe jobs as well. I have just installed condor as these instructions mentioned, with a few other alterations to customize my environment.
I am running RHEL 6.3 Server 64-bit on my master and compute nodes (about 50). I am running Condor 7.8.7. I have tried a few parallel mpi scripts including the one mentioned here in this thread.
I come up with the following error in the log for that script in /var/log/codor/ShadowLog:
04/08/13 14:06:26 Initializing a PARALLEL shadow for job 50.0
04/08/13 14:06:26 (50.0) (24348): condor_write(): Socket closed when trying to write 37 bytes to daemon at <Server.IP.Cleaned:46094>, fd is 5
04/08/13 14:06:26 (50.0) (24348): Buf::write(): condor_write() failed
04/08/13 14:06:26 (50.0) (24348): ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <Server.IP.Cleaned:46094> (try 1 of 3): CEDAR:6002:failed to send EOM
04/08/13 14:06:27 (50.0) (24348): ERROR "Failed to get number of procs" at line 241 in file /slots/01/dir_65060/userdir/src/condor_shadow.V6.1/parallelshadow.cpp
Dedicated Server’s /etc/condor/condor_config.local
## What machine is your central manager?
CONDOR_HOST = Server.Name.Cleaned
## Pool's short description
COLLECTOR_NAME = "Server.Name.Cleaned"
#FOR MPI and other Parallel Universe runs
Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"
## When is this machine willing to start a job?
START = TRUE
## When to suspend a job?
SUSPEND = FALSE
## When to nicely stop a job?
## (as opposed to killing it instantaneously)
PREEMPT = FALSE
## When to instantaneously kill a preempting job
## (e.g. if a job is in the pre-empting stage for too long)
KILL = FALSE
## This macro determines what daemons the condor_master will start and keep its watchful eyes on.
## The list is a comma or space separated list of subsystem names
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD
- - - - - - - - - - - - - - - - - - - - -
All Compute Node’s /etc/condor/condor_config.local
## What machine is your central manager?
CONDOR_HOST = Server.Name.Cleaned
## Pool's short description
COLLECTOR_NAME = "Server.Name.Cleaned"
#FOR Parallel MPI files to run
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"
STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
CONTINUE = True
WANT_SUSPEND = False
WANT_VACATE = False
RANK = Scheduler =?= $(DedicatedScheduler)
## When is this machine willing to start a job?
START = TRUE
## When to suspend a job?
SUSPEND = FALSE
## When to nicely stop a job?
## (as opposed to killing it instantaneously)
PREEMPT = FALSE
## When to instantaneously kill a preempting job
## (e.g. if a job is in the pre-empting stage for too long)
KILL = FALSE
## This macro determines what daemons the condor_master will start and keep its watchful eyes on.
## The list is a comma or space separated list of subsystem names
DAEMON_LIST = MASTER, STARTD
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
----------
Note the STARTD_EXPRS and the STARTD_ATTRS the online manual referenced has STARTD_ATTRS but the example file for Condor 7.8.7 says STARTD_EXPRS so I put both in. When I run a script it just hangs with the following in the job’s log:
007 (049.000.000) 04/08 13:39:56 Shadow exception!
Failed to get number of procs
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
007 (049.000.000) 04/08 13:40:00 Shadow exception!
Failed to get number of procs
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
009 (049.000.000) 04/08 13:40:00 Job was aborted by the user.
via condor_rm (by user condor)
I tried just tossing in the example script as it was with the server name changed and simply get the same results. A condor_status lists all the Compute Nodes in an unclaimed Idle state. When I specify machine_count=X . X number of nodes become claimed but the job just sits idle. Is there anyone who has any thoughts on this?
From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of David Hentchel
Sent: Monday, April 08, 2013 11:39 AM
To: jerome.leconte@xxxxxxxxxxxxxxx; HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Job Submission in Parallel Universe
I just went through this as a beginner. The key is to set up every host running a Start daemon to reference a "dedicated scheduler" in the pool, according the the instructions in the manual section labelled "3.12.8 HTCondor's Dedicated Scheduling". You can also merge the examples in the condor install (see $INSTALL_DIR/etc/examples/condor_config.local.dedicated.resource). All I had to change from that example was the hostname for the machine I wanted to use for the dedicated scheduler.
Hope this helps.
On Mon, Apr 8, 2013 at 8:45 AM, leconte <jerome.leconte@xxxxxxxxxxxxxxx> wrote:
Le 08/04/2013 09:53, Muak rules a écrit :
I'm submitting my first parallel universe job. which is basically a simple hello world problem.
I'm using CentOs 6.3 and had installed HTCondor using "yum install condor".
I'm not sure about about version of HTCondor. When I submit it the job goes on idle state.
Please help me out through this......Following is my submit description file.
universe=parallel
executable=mpi_hello_world
machine_count=1
log=hello.log
out=hello.out
queue
The following attachment contain my job.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/hello Muak,
I have testing your prog and submit file on my own test cluster. it work fine.
I suppose it is not your prog nor submit file but your configuration that cause the problem.
I've corrected only the line
out=hello.out
by
output=hello.out
my condor complains about it.
I don't know if I can correct your problem but can you post your cluster config ?
Greetings
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
David Hentchel
Performance Engineer
(617) 803 - 1193
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/--David Hentchel
Performance Engineer
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
David Hentchel
Performance Engineer
(617) 803 - 1193