[HTCondor-users] Debugging config setup for Parallel Universe machine pool

I am a new Condor user and am stalled attempting to execute a Parallel Universe job on a setup with 2 host machines.

Our software is designed to scale to thousands of host machines, with the aggregate behaving as if was a single logical database running on a huge central machine. As such, HTCondor seems like the ideal tool to manage our multi-host testing requirements.

Condor installation and initial setup was easy. Specifics:

- version 7.6.10

- non-root install

- Manager, Scheduler and 1 Execute node on host p1

- Pooled execute node on host p2

- after initial install a simple Vanilla Universe job ran fine

- have a shared main condor_config, plus a localized condor_config.local for each host

I made the following config file changes (note, in all these samples I'm replacing IP addresses and full hostnames with short hostname):

============ Main condor.config ================

## central pool manager

COLLECTOR_NAME = NuoDB-DHentchel-p1

## Map Scheduler to Parallel Universe group; this defines a pool for concurrent, parallel runs

SCHEDD_NAME = DedicatedScheduler

DedicatedScheduler = "DedicatedScheduler@p1"

ParallelSchedulingGroup = "P5"

## Security settings:

ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST)

============ Each condor_config.local ================

## Bind to parent scheduler group, to enable parallel universe dispatch

STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler

STARTD_ATTRS = $(STARTD_ATTRS), ParallelSchedulingGroup

## Tune STARTD for dedicated, parallel scheduling

START = True

RANK = Scheduler =?= $(DedicatedScheduler)

LOCAL_DIR = /var/local/condor/$(HOSTNAME)

(For p1 only):

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

ALLOW_WRITE = $(ALLOW_WRITE), p2

(for p2 only):

DAEMON_LIST = MASTER, STARTD

I validated that the critical config variables are set on both machines:

condor_config_val COLLECTOR_NAME ==> NuoDB-DHentchel-p1

condor_config_val SCHEDD_NAME ==> DedicatedScheduler

condor_config_val DedicatedScheduler ==> "DedicatedScheduler@p1"

condor_config_val ParallelSchedulingGroup ==> "P1"

condor_config_val STARTD_ATTRS ==> COLLECTOR_HOST_STRING, DedicatedScheduler, ParallelSchedulingGroup

condor_config_val RANK ==> Scheduler =?= "DedicatedScheduler@p1"

Now I submit the following job:

universe = parallel

Scheduler = "DedicatedScheduler@p1"

executable = /bin/sleep

arguments = 30

machine_count = 2

+WantParallelSchedulingGroups = True

error = log/simple-sleep.$(PID).err

output = log/simple-sleep.$(PID).out

log = log/simple-sleep.$(PID).log

queue

6. This sits on the queue and is never dispatched.

-- Submitter: DedicatedScheduler@p1: <p1:51203> : p1

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

1.0 dhentchel 3/21 16:12 0+00:00:00 I 0 0.0 sleep 30

2.0 dhentchel 3/22 09:57 0+00:00:00 I 0 0.0 sleep 30

From the log files, I see some interesting messages.

CollectorLog:

03/21/13 16:12:34 SubmittorAd : Inserting ** "< dhentchel@xxxxxxxxx DedicatedScheduler@p1 , p1 >"

03/21/13 16:12:34 stats: Inserting new hashent for 'Submittor':'dhentchel@xxxxxxxxx':'p1'

MatchLog:

03/22/13 09:57:46 Matched 2.0 dhentchel@xxxxxxxxx <p1:51203> preempting none <p2:33543> slot1@p2

03/22/13 09:58:46 Matched 2.0 dhentchel@xxxxxxxxx <p1:51203> preempting none <p2:33543> slot2@p2

etc, etc. all expected slots on both hosts show up

SchedLog

03/21/13 16:09:08 (pid:10111) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)

03/21/13 16:12:34 (pid:10111) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s

03/21/13 16:12:34 (pid:10111) Sent ad to central manager for dhentchel@xxxxxxxxx

03/21/13 16:12:34 (pid:10111) Sent ad to 1 collectors for dhentchel@xxxxxxxxx

03/21/13 16:12:34 (pid:10111) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1

03/21/13 16:12:34 (pid:10111) Trying to satisfy job with group scheduling

03/21/13 16:12:34 (pid:10111) Job requested parallel scheduling groups, but no groups found

So all condor processes seem to be communicating correctly, but the submit request with "universe = parallel"

is failing with message "Job requested parallel scheduling groups, but no groups found"

I'm hoping that someone can identify what is missing in the configuration, or perhaps give me advice on how to dig deeper to find where the machine pool setup is going astray.

thanks,

dave

David Hentchel

Performance Engineer

www.nuodb.com

(617) 803 - 1193