Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] 'parallel' universe job submission crashes SCHEDD

Date: Tue, 28 Dec 2010 21:01:25 -0500
From: Michael Hanke <michael.hanke@xxxxxxxxx>
Subject: [Condor-users] 'parallel' universe job submission crashes SCHEDD

Hi,

I'm trying to figure out a suitable configuration for our condor pool. I
have managed to get vanilla jobs working nicely and I am now working on
the configuration for the parallel universe. I was basically following
the manual line by line. However upon submission of a job it crashed the
schedd and the job stays idle forever:

The submit file looks like this:

  Executable = /bin/sleep
  Arguments  = 30
  machine_count = 2
  Universe   = parallel
  output     = runner.out
  error      = runner.error
  Log        = runner.log
  Queue

The last traces from SchedLog:

12/28/10 20:32:21 (pid:15569) Sent ad to central manager for XXX@xxxxxxxxxxxxxxxxxxx
12/28/10 20:32:21 (pid:15569) Sent ad to 1 collectors for XXX@xxxxxxxxxxxxxxxxxxx
12/28/10 20:32:21 (pid:15569) Can't find CondorPlatform in classad for negotiator head1.xxxxx.xxxxxxxxx.xxx
12/28/10 20:32:21 (pid:15569) Can't find CondorPlatform in classad for negotiator head1.xxxxx.xxxxxxxxx.xxx
12/28/10 20:32:21 (pid:15569) Inserting new attribute Scheduler into non-active cluster cid=60 acid=-1
12/28/10 20:32:26 (pid:15569) Sent ad to central manager for XXX@xxxxxxxxxxxxxxxxxxx
12/28/10 20:32:26 (pid:15569) Sent ad to 1 collectors for XXX@xxxxxxxxxxxxxxxxxxx
12/28/10 20:32:26 (pid:15569) Can't find CondorPlatform in classad for negotiator head1.xxxxx.xxxxxxxxx.xxx
12/28/10 20:32:26 (pid:15569) Can't find CondorPlatform in classad for negotiator head1.xxxxx.xxxxxxxxx.xxx
12/28/10 20:32:26 (pid:15569) Inserting new attribute Scheduler into non-active cluster cid=60 acid=-1
12/28/10 20:32:27 (pid:15569) Activity on stashed negotiator socket: <10.0.0.1:50077>
12/28/10 20:32:27 (pid:15569) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx
12/28/10 20:32:27 (pid:15569) Can't find CondorPlatform in classad for startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxx
12/28/10 20:32:27 (pid:15569) Request was NOT accepted for claim slot1@xxxxxxxxxxxxxxxxxxxxxxxxx <10.0.0.5:48869> for DedicatedScheduler 60.0
12/28/10 20:32:27 (pid:15569) Completed REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxx <10.0.0.5:48869> for DedicatedScheduler
12/28/10 20:32:27 (pid:15569) ERROR "Assertion ERROR on (rec->cluster == -1)" at line 3093 in file /tmp/buildd/condor-7.5.4+git530-gcc99723/src/condor_schedd.V6/dedicated_scheduler.cpp

as you can see from the commit hash in the last line, this is using
fairly recent code. The comment just before the failing assertion says:

// Cool, this match wasn't allocated to anyone, so we
// don't have to worry about it.  If the match isn't
// allocated to anyone, the cluster better be -1.

Looks like I am facing two problems: 1. The job is not successfully
scheduled in the first place and 2. schedd crashed.

In the execute node StartLog this time frame looks like this:


12/28/10 20:32:27 slot1: match_info called
12/28/10 20:32:27 slot1: Received match <10.0.0.5:48869>#1293584652#6#...
12/28/10 20:32:27 slot1: Started match timer (120) for 120 seconds.
12/28/10 20:32:27 slot1: State change: match notification protocol successful
12/28/10 20:32:27 slot1: Changing state: Unclaimed -> Matched
12/28/10 20:32:27 slot1: Canceled match timer (120)
12/28/10 20:32:27 slot1: Schedd addr = <10.0.0.1:57946>
12/28/10 20:32:27 slot1: Alive interval = 300
12/28/10 20:32:27 slot1: Received ClaimId from schedd (<10.0.0.5:48869>#1293584652#6#...)
12/28/10 20:32:27 slot1: Machine requirements not satisfied.
12/28/10 20:32:27 slot1: State change: claiming protocol failed
12/28/10 20:32:27 slot1: Changing state: Matched -> Owner
12/28/10 20:32:27 slot1: State change: IS_OWNER is false
12/28/10 20:32:27 slot1: Changing state: Owner -> Unclaimed


So, although initially matched the job is gets rejected in the end.  I
cannot figure out what 'machine requirements' aren't satisfied.
Submitting the same job as vanilla work like charm. I suspect it has
something to do with the slot configuration (see below).

Here is the relevant config of the execute node:


SLOT_TYPE_1 = cpus=100%, ram=98%, swap=10%, disk=50%
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 1
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE
ALLOW_WRITE = *.xxxxx.xxxxxxxxx.xxx
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx"
STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
STARTD_DEBUG = D_FULLDEBUG


Before the crash and after the schedd is restarted submission of vanilla
jobs continues to work just fine.

Thanks for any advice,

Michael

-- 
Michael Hanke
http://XXX.voxindeserto.de

Follow-Ups:
- Re: [Condor-users] 'parallel' universe job submission crashes SCHEDD
  - From: Michael Hanke

Prev by Date: Re: [Condor-users] jobs vacating reason
Next by Date: Re: [Condor-users] Condor killing jobs when other are completed?
Previous by thread: Re: [Condor-users] notify reporting .condor_run contents
Next by thread: Re: [Condor-users] 'parallel' universe job submission crashes SCHEDD
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[Condor-users] 'parallel' universe job submission crashes SCHEDD