Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] 'parallel' universe job submission crashes SCHEDD
- Date: Tue, 28 Dec 2010 21:01:25 -0500
- From: Michael Hanke <michael.hanke@xxxxxxxxx>
- Subject: [Condor-users] 'parallel' universe job submission crashes SCHEDD
Hi,
I'm trying to figure out a suitable configuration for our condor pool. I
have managed to get vanilla jobs working nicely and I am now working on
the configuration for the parallel universe. I was basically following
the manual line by line. However upon submission of a job it crashed the
schedd and the job stays idle forever:
The submit file looks like this:
Executable = /bin/sleep
Arguments = 30
machine_count = 2
Universe = parallel
output = runner.out
error = runner.error
Log = runner.log
Queue
The last traces from SchedLog:
12/28/10 20:32:21 (pid:15569) Sent ad to central manager for XXX@xxxxxxxxxxxxxxxxxxx
12/28/10 20:32:21 (pid:15569) Sent ad to 1 collectors for XXX@xxxxxxxxxxxxxxxxxxx
12/28/10 20:32:21 (pid:15569) Can't find CondorPlatform in classad for negotiator head1.xxxxx.xxxxxxxxx.xxx
12/28/10 20:32:21 (pid:15569) Can't find CondorPlatform in classad for negotiator head1.xxxxx.xxxxxxxxx.xxx
12/28/10 20:32:21 (pid:15569) Inserting new attribute Scheduler into non-active cluster cid=60 acid=-1
12/28/10 20:32:26 (pid:15569) Sent ad to central manager for XXX@xxxxxxxxxxxxxxxxxxx
12/28/10 20:32:26 (pid:15569) Sent ad to 1 collectors for XXX@xxxxxxxxxxxxxxxxxxx
12/28/10 20:32:26 (pid:15569) Can't find CondorPlatform in classad for negotiator head1.xxxxx.xxxxxxxxx.xxx
12/28/10 20:32:26 (pid:15569) Can't find CondorPlatform in classad for negotiator head1.xxxxx.xxxxxxxxx.xxx
12/28/10 20:32:26 (pid:15569) Inserting new attribute Scheduler into non-active cluster cid=60 acid=-1
12/28/10 20:32:27 (pid:15569) Activity on stashed negotiator socket: <10.0.0.1:50077>
12/28/10 20:32:27 (pid:15569) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx
12/28/10 20:32:27 (pid:15569) Can't find CondorPlatform in classad for startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxx
12/28/10 20:32:27 (pid:15569) Request was NOT accepted for claim slot1@xxxxxxxxxxxxxxxxxxxxxxxxx <10.0.0.5:48869> for DedicatedScheduler 60.0
12/28/10 20:32:27 (pid:15569) Completed REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxx <10.0.0.5:48869> for DedicatedScheduler
12/28/10 20:32:27 (pid:15569) ERROR "Assertion ERROR on (rec->cluster == -1)" at line 3093 in file /tmp/buildd/condor-7.5.4+git530-gcc99723/src/condor_schedd.V6/dedicated_scheduler.cpp
as you can see from the commit hash in the last line, this is using
fairly recent code. The comment just before the failing assertion says:
// Cool, this match wasn't allocated to anyone, so we
// don't have to worry about it. If the match isn't
// allocated to anyone, the cluster better be -1.
Looks like I am facing two problems: 1. The job is not successfully
scheduled in the first place and 2. schedd crashed.
In the execute node StartLog this time frame looks like this:
12/28/10 20:32:27 slot1: match_info called
12/28/10 20:32:27 slot1: Received match <10.0.0.5:48869>#1293584652#6#...
12/28/10 20:32:27 slot1: Started match timer (120) for 120 seconds.
12/28/10 20:32:27 slot1: State change: match notification protocol successful
12/28/10 20:32:27 slot1: Changing state: Unclaimed -> Matched
12/28/10 20:32:27 slot1: Canceled match timer (120)
12/28/10 20:32:27 slot1: Schedd addr = <10.0.0.1:57946>
12/28/10 20:32:27 slot1: Alive interval = 300
12/28/10 20:32:27 slot1: Received ClaimId from schedd (<10.0.0.5:48869>#1293584652#6#...)
12/28/10 20:32:27 slot1: Machine requirements not satisfied.
12/28/10 20:32:27 slot1: State change: claiming protocol failed
12/28/10 20:32:27 slot1: Changing state: Matched -> Owner
12/28/10 20:32:27 slot1: State change: IS_OWNER is false
12/28/10 20:32:27 slot1: Changing state: Owner -> Unclaimed
So, although initially matched the job is gets rejected in the end. I
cannot figure out what 'machine requirements' aren't satisfied.
Submitting the same job as vanilla work like charm. I suspect it has
something to do with the slot configuration (see below).
Here is the relevant config of the execute node:
SLOT_TYPE_1 = cpus=100%, ram=98%, swap=10%, disk=50%
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 1
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE
ALLOW_WRITE = *.xxxxx.xxxxxxxxx.xxx
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx"
STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
STARTD_DEBUG = D_FULLDEBUG
Before the crash and after the schedd is restarted submission of vanilla
jobs continues to work just fine.
Thanks for any advice,
Michael
--
Michael Hanke
http://XXX.voxindeserto.de